These data were taken from this paper in mSystems, investigating the link between soil physico-chemical properties, vegetation, and bacterial community structure.
Abstract:
Global deserts occupy one-third of the Earth’s surface and contribute significantly to organic carbon storage, a process at risk in dryland ecosystems that are highly vulnerable to climate-driven ecosystem degradation. The forces controlling desert ecosystem degradation rates are poorly understood, particularly with respect to the relevance of the arid-soil microbiome. Here we document correlations between increasing aridity and soil bacterial and archaeal microbiome composition along arid to hyperarid transects traversing the Atacama Desert, Chile. A meta-analysis reveals that Atacama soil microbiomes exhibit a gradient in composition, are distinct from a broad cross-section of nondesert soils, and yet are similar to three deserts from different continents. Community richness and diversity were significantly positively correlated with soil relative humidity (SoilRH). Phylogenetic composition was strongly correlated with SoilRH, temperature, and electrical conductivity. The strongest and most significant correlations between SoilRH and phylum relative abundance were observed for Acidobacteria, Proteobacteria, Planctomycetes, Verrucomicrobia, and Euryarchaeota (Spearman’s rank correlation [rs] = >0.81; false-discovery rate [q] = ≤0.005), characterized by 10- to 300-fold decreases in the relative abundance of each taxon. In addition, network analysis revealed a deterioration in the density of significant associations between taxa along the arid to hyperarid gradient, a pattern that may compromise the resilience of hyperarid communities because they lack properties associated with communities that are more integrated. In summary, results suggest that arid-soil microbiome stability is sensitive to aridity as demonstrated by decreased community connectivity associated with the transition from the arid class to the hyperarid class and the significant correlations observed between soilRH and both diversity and the relative abundances of key microbial phyla typically dominant in global soils.
.
Load the packages we’ll need and tell R to use multiple cores where available
####### Parallel computation
#options (mc.cores=parallel::detectCores ())
## Microbiome Tools
library(phyloseq)
library(vegan)
library(microbiome)
#install.packages("devtools")
#devtools::install_github("jbisanz/qiime2R")
library(qiime2R) #for importing QIIME artifacts into phyloseq
library(DESeq2)
library(labdsv)
library(ALDEx2)
##Plotting
library(ggplot2)
library(RColorBrewer) #pretty colour pallettes
library(gplots) #venn diagrams
library(cowplot) #saving graphs
library(pheatmap) #alternative heatmaps
#Data Manipulation
library(tidyverse)
library(MicrobeDS)
#Stats Models
library(lme4)
library(MuMIn)
library(car)
Quick access options for output graphics to make legends and axis labels larger / more legible
#Global Plot Options
plotopts<- theme(axis.text=element_text(size=20),axis.title=element_text(size=22),strip.text=element_text(size=22),legend.title = element_text(size=20),legend.text = element_text(size=20))
Now that we’ve done that, we can read in the files we’ve just copied over. The package QIIME2R has a built in function to import the artefacts we built in QIIME into a format that can be read by the package phyloseq that we’ll use for a lot of our data manipulation and statistics. We tell the function which of the four files correspond to each of the ASV table, taxonomy of those ASVs, sample metadata and the phylogenetic tree of sequences.
### Build Phyloseq Object
physeq<-qza_to_phyloseq("table_16S.qza","rooted-tree.qza","taxonomy_16S_SKLEARN.qza", "sample-metadata.tsv")
Once we’ve made the object, we can ‘inspect it’ by typing its name:
physeq
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 1099 taxa and 54 samples ]
## sample_data() Sample Data: [ 54 samples by 20 sample variables ]
## tax_table() Taxonomy Table: [ 1099 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 1099 tips and 1095 internal nodes ]
This tells us what’s contained in each of the 4 elements of the phyloseq object. Crucially we can see that we have 54 samples, and have identified ~ 1100 microbial taxa (the otu_table() line). In addition, the sample_data() line tells us we have 22 sample variables in our metadata file. Note how the taxonomy, otu, and tree files all have the same number of taxa/tips.
Note that you can omit the tree from this step - PhyloSeq only needs the first three items to build a complete object. It just means you wont be able to calculate UNIFRAC statistics. Test it.
### Build Phyloseq Object
physeq_no_tree<-qza_to_phyloseq(features="table_16S.qza",taxonomy="taxonomy_16S_SKLEARN.qza", metadata="sample-metadata.tsv")
physeq_no_tree
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 1099 taxa and 54 samples ]
## sample_data() Sample Data: [ 54 samples by 20 sample variables ]
## tax_table() Taxonomy Table: [ 1099 taxa by 7 taxonomic ranks ]
You can access each of the different elements of the phyloseq object using the commands as follows:
Sample Data: sample_data(physeq)
Taxonomy: tax_table(physeq)
ASV Matrix otu_table(physeq)
Note how even though we’re using ASVs (inferred by DADA2 called in QIIME), phyloseq still refers to them as OTUs in the abundance matrix. Let’s test this out on the taxonomy element of our phyloseq object
head(tax_table(physeq),20)
## Taxonomy Table: [20 taxa by 7 taxonomic ranks]:
## Kingdom Phylum
## c2fc985801922717514f78cfae22c41f "d__Archaea" "Thermoplasmatota"
## b56103557f27fd289b1425d7c527e5ff "d__Archaea" "Thermoplasmatota"
## 744b943584ceaf276ae4ad181666daa9 "d__Archaea" "Thermoplasmatota"
## c9e32079e7c3e2a804a14ce56dfdcc05 "d__Archaea" "Thermoplasmatota"
## dfe7e74bb1e45ea73a95e1de0f813fb3 "d__Archaea" "Thermoplasmatota"
## f5f682d559c7a76fe77e9af2daf2ad4c "d__Archaea" "Thermoplasmatota"
## d8ffb44ef64f7589f69f8ade1a4f79d8 "d__Archaea" "Thermoplasmatota"
## f48560eb73e52267345d8ff3e31e1791 "d__Archaea" "Thermoplasmatota"
## cf400229268468ae928415e072520baf "d__Archaea" "Thermoplasmatota"
## 1d1837728dace68b39ad4eab8a996e66 "d__Archaea" "Thermoplasmatota"
## c2252c3545754ab07a79ca051971f458 "d__Archaea" "Nanoarchaeota"
## f7f5d6bf9b30ab4087d623c77c5e6505 "d__Archaea" "Crenarchaeota"
## 3fd9f5654f31b22598ba0028a8bdbe0d "d__Archaea" "Crenarchaeota"
## a56b903521b8ab33a7878b35582803a8 "d__Archaea" "Crenarchaeota"
## ad752f310a6e3164bdaad203b9c0ed38 "d__Archaea" "Crenarchaeota"
## 470e5cda86f57a2e32f06633fc253fef "d__Archaea" "Crenarchaeota"
## 7213f7039e323dbbbb58f197b116a0d0 "d__Archaea" "Crenarchaeota"
## 7c64d5a7549d568c258e1a26a3cc0aa5 "d__Archaea" "Crenarchaeota"
## ce857a8a5d2851f376e613e27ec4f5e1 "d__Archaea" "Crenarchaeota"
## 0b533f8dfe4baaa9ffe85c19be904d23 "d__Archaea" "Crenarchaeota"
## Class Order
## c2fc985801922717514f78cfae22c41f "Thermoplasmata" "Marine_Group_II"
## b56103557f27fd289b1425d7c527e5ff "Thermoplasmata" "Marine_Group_II"
## 744b943584ceaf276ae4ad181666daa9 "Thermoplasmata" "Marine_Group_II"
## c9e32079e7c3e2a804a14ce56dfdcc05 "Thermoplasmata" "Marine_Group_II"
## dfe7e74bb1e45ea73a95e1de0f813fb3 "Thermoplasmata" "Marine_Group_II"
## f5f682d559c7a76fe77e9af2daf2ad4c "Thermoplasmata" "Marine_Group_II"
## d8ffb44ef64f7589f69f8ade1a4f79d8 "Thermoplasmata" "Marine_Group_II"
## f48560eb73e52267345d8ff3e31e1791 "Thermoplasmata" "Marine_Group_II"
## cf400229268468ae928415e072520baf "Thermoplasmata" "Marine_Group_II"
## 1d1837728dace68b39ad4eab8a996e66 "Thermoplasmata" NA
## c2252c3545754ab07a79ca051971f458 "Nanoarchaeia" "Woesearchaeales"
## f7f5d6bf9b30ab4087d623c77c5e6505 "Nitrososphaeria" "Nitrososphaerales"
## 3fd9f5654f31b22598ba0028a8bdbe0d "Nitrososphaeria" "Nitrososphaerales"
## a56b903521b8ab33a7878b35582803a8 "Nitrososphaeria" "Nitrososphaerales"
## ad752f310a6e3164bdaad203b9c0ed38 "Nitrososphaeria" "Nitrososphaerales"
## 470e5cda86f57a2e32f06633fc253fef "Nitrososphaeria" "Nitrososphaerales"
## 7213f7039e323dbbbb58f197b116a0d0 "Nitrososphaeria" "Nitrososphaerales"
## 7c64d5a7549d568c258e1a26a3cc0aa5 "Nitrososphaeria" "Nitrososphaerales"
## ce857a8a5d2851f376e613e27ec4f5e1 "Nitrososphaeria" "Nitrososphaerales"
## 0b533f8dfe4baaa9ffe85c19be904d23 "Nitrososphaeria" "Nitrososphaerales"
## Family
## c2fc985801922717514f78cfae22c41f "Marine_Group_II"
## b56103557f27fd289b1425d7c527e5ff "Marine_Group_II"
## 744b943584ceaf276ae4ad181666daa9 "Marine_Group_II"
## c9e32079e7c3e2a804a14ce56dfdcc05 "Marine_Group_II"
## dfe7e74bb1e45ea73a95e1de0f813fb3 "Marine_Group_II"
## f5f682d559c7a76fe77e9af2daf2ad4c "Marine_Group_II"
## d8ffb44ef64f7589f69f8ade1a4f79d8 "Marine_Group_II"
## f48560eb73e52267345d8ff3e31e1791 "Marine_Group_II"
## cf400229268468ae928415e072520baf "Marine_Group_II"
## 1d1837728dace68b39ad4eab8a996e66 NA
## c2252c3545754ab07a79ca051971f458 "GW2011_GWC1_47_15"
## f7f5d6bf9b30ab4087d623c77c5e6505 "Nitrososphaeraceae"
## 3fd9f5654f31b22598ba0028a8bdbe0d "Nitrososphaeraceae"
## a56b903521b8ab33a7878b35582803a8 "Nitrososphaeraceae"
## ad752f310a6e3164bdaad203b9c0ed38 "Nitrososphaeraceae"
## 470e5cda86f57a2e32f06633fc253fef "Nitrososphaeraceae"
## 7213f7039e323dbbbb58f197b116a0d0 "Nitrososphaeraceae"
## 7c64d5a7549d568c258e1a26a3cc0aa5 "Nitrososphaeraceae"
## ce857a8a5d2851f376e613e27ec4f5e1 "Nitrososphaeraceae"
## 0b533f8dfe4baaa9ffe85c19be904d23 "Nitrososphaeraceae"
## Genus
## c2fc985801922717514f78cfae22c41f "Marine_Group_II"
## b56103557f27fd289b1425d7c527e5ff "Marine_Group_II"
## 744b943584ceaf276ae4ad181666daa9 "Marine_Group_II"
## c9e32079e7c3e2a804a14ce56dfdcc05 "Marine_Group_II"
## dfe7e74bb1e45ea73a95e1de0f813fb3 "Marine_Group_II"
## f5f682d559c7a76fe77e9af2daf2ad4c "Marine_Group_II"
## d8ffb44ef64f7589f69f8ade1a4f79d8 "Marine_Group_II"
## f48560eb73e52267345d8ff3e31e1791 "Marine_Group_II"
## cf400229268468ae928415e072520baf "Marine_Group_II"
## 1d1837728dace68b39ad4eab8a996e66 NA
## c2252c3545754ab07a79ca051971f458 "GW2011_GWC1_47_15"
## f7f5d6bf9b30ab4087d623c77c5e6505 "Candidatus_Nitrososphaera"
## 3fd9f5654f31b22598ba0028a8bdbe0d "Candidatus_Nitrososphaera"
## a56b903521b8ab33a7878b35582803a8 "Candidatus_Nitrososphaera"
## ad752f310a6e3164bdaad203b9c0ed38 "Candidatus_Nitrososphaera"
## 470e5cda86f57a2e32f06633fc253fef "Candidatus_Nitrososphaera"
## 7213f7039e323dbbbb58f197b116a0d0 "Nitrososphaeraceae"
## 7c64d5a7549d568c258e1a26a3cc0aa5 "Nitrososphaeraceae"
## ce857a8a5d2851f376e613e27ec4f5e1 "Nitrososphaeraceae"
## 0b533f8dfe4baaa9ffe85c19be904d23 "Nitrososphaeraceae"
## Species
## c2fc985801922717514f78cfae22c41f "uncultured_archaeon"
## b56103557f27fd289b1425d7c527e5ff "uncultured_archaeon"
## 744b943584ceaf276ae4ad181666daa9 "uncultured_archaeon"
## c9e32079e7c3e2a804a14ce56dfdcc05 "uncultured_archaeon"
## dfe7e74bb1e45ea73a95e1de0f813fb3 "uncultured_archaeon"
## f5f682d559c7a76fe77e9af2daf2ad4c NA
## d8ffb44ef64f7589f69f8ade1a4f79d8 "uncultured_archaeon"
## f48560eb73e52267345d8ff3e31e1791 "uncultured_archaeon"
## cf400229268468ae928415e072520baf "uncultured_haloarchaeon"
## 1d1837728dace68b39ad4eab8a996e66 NA
## c2252c3545754ab07a79ca051971f458 "uncultured_archaeon"
## f7f5d6bf9b30ab4087d623c77c5e6505 NA
## 3fd9f5654f31b22598ba0028a8bdbe0d "unidentified_archaeon"
## a56b903521b8ab33a7878b35582803a8 "unidentified_archaeon"
## ad752f310a6e3164bdaad203b9c0ed38 "unidentified_archaeon"
## 470e5cda86f57a2e32f06633fc253fef "unidentified_archaeon"
## 7213f7039e323dbbbb58f197b116a0d0 "uncultured_thaumarchaeote"
## 7c64d5a7549d568c258e1a26a3cc0aa5 NA
## ce857a8a5d2851f376e613e27ec4f5e1 NA
## 0b533f8dfe4baaa9ffe85c19be904d23 "metagenome"
We’ve asked for the top 20 records from our taxonomy file. You can see that the columns are named in a hierarchical fashion from Kingdom to Species. You will also notice that there’s variation in the precision with which we have been able to assign taxonomy. Some ASVs can only be assigned to the Bacterial Kingdom. One appear to be have been assigned to species level.
There seem to be a lot of Archaea in the top of this dataset. We can ask hw many of our ASVs belong to different phyla with a quick tabulate command
table(tax_table(physeq)[,1])
##
## d__Archaea d__Bacteria
## 32 1067
We can ask what proportion of assignments we have made at each level with a quick line of code:
apply(tax_table(physeq)[,2:7],2,function(x){1-mean(is.na(x))})
## Phylum Class Order Family Genus Species
## 0.9927207 0.9863512 0.9772520 0.9681529 0.9080983 0.4868062
Roughly 99% to Phylum level. The figures for Genus and Species level are a bit misleading, because a lot of species are assigned as ‘unindentified’ if you’ve used the SILVA database to assign taxonomy. Overall though, it’s not surprising that precision decreases with more fine-scale taxonomy. Remember we have only sequenced a couple of hundred bases of a highly conserved marker gene. That we are able to assign ANY sequences to species is remarkable.
Assuming we want to restrict our analyses to Bacteria, we can filter out taxa assigned to other Phyla like Archaea. FOr this we evaluate whether the 1st column of the taxonomy table - the Phylum - has the entry ‘Bacteria’, and if so to keep it.
physeq_bacteria<-prune_taxa(as.logical(tax_table(physeq)[,1]=="d__Bacteria"),physeq)
physeq_bacteria
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 1067 taxa and 54 samples ]
## sample_data() Sample Data: [ 54 samples by 20 sample variables ]
## tax_table() Taxonomy Table: [ 1067 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 1067 tips and 1063 internal nodes ]
We can see that we have lost some taxa here
ntaxa(physeq); ntaxa(physeq_bacteria)
## [1] 1099
## [1] 1067
Now let’s look at the metadata file using the same code:
colnames(sample_data(physeq_bacteria))
## [1] "barcode.sequence" "elevation"
## [3] "extract.concen" "amplicon.concentration"
## [5] "extract.group.no" "transect.name"
## [7] "site.name" "depth"
## [9] "ph" "toc"
## [11] "ec" "average.soil.relative.humidity"
## [13] "relative.humidity.soil.high" "relative.humidity.soil.low"
## [15] "percent.relative.humidity.soil.100" "average.soil.temperature"
## [17] "temperature.soil.high" "temperature.soil.low"
## [19] "vegetation" "percentcover"
We can see that these metadata are a mix of those associated with sample preparation (BarcodeSequence, extraction concentration etc.), sampling design (Site and Transect Name), and biological variables like pH, humidity and vegetation characteristics.
For this tutorial, we haven’t been provided with any samples that represent positive controls - mock communities of artificially assembled bacteria and so of known composition/ species diversity that we can use to benchmark our assessment of per-sample diversity. You should always include positive controls in your experimental design.
Likewise there are no negative controls, which would allow us to what sequences have cropped up in the negatives controls to get a handle on the likely levels of background contamination. This could include microbes living in the salt-rich PCR mastermix, or the extraction kit, or background contamination in the lab where you processed your samples.
It’s worth noting here that how to deal with negative contamination is a contentious issue. One strategy is to simply remove any sequence that popped up in the negative controls from ALL samples. This is intuitive and has the advantage of being objective, but sometimes you can pick up extremely low abundance sequences that are highly abundant in other samples - suggesting some nefarious aerosol of DNA contaminated the negative from a real sample. Noah Fierer and co. have written a blog on sources of contamination and how to deal with them
Also be sure to check out the R package DECONTAM
We might want to subset the data to only reads that appear in our dataset with a given frequency. This can make our lives simpler by reducing computation time for downstream steps, and also reduce noise caused by low-abundance sequences with low coverage where can be less certain of their relative abundance. We could do this many ways, including only selecting:
1) Seqs that appear in at least X% of samples
2) Seqs that appear at a frequency of Y% in the entire dataset
3) Seqs that pass a certain raw reads threshold, i.e. must appear at least Z times in the dataset.
Here, we’ll subset to only sequences with a least 100 reads:
# Removing OTUs with fewer than 100 reads
physeq100<-prune_taxa(taxa_sums(physeq_bacteria)>100,physeq_bacteria)
#How Many Taxa Did We Lose?
ntaxa(physeq100); ntaxa(physeq_bacteria)
## [1] 123
## [1] 1067
Now let’s compute exactly how may taxa we lost.
ntaxa(physeq_bacteria)- ntaxa(physeq100)
## [1] 944
Now that we’ve done that, we can check what our post-QC library sizes are. It’s a good idea to report these in manuscripts and thesis chapters. Here was ask what the minimum, maximum and mean library sizes per sample are.
<br>
############## POST QC LIBRARY SIZES
###############
# Post-QC Library Stats
###############
mean(sample_sums(physeq100))
## [1] 692.2037
range(sample_sums(physeq100))
## [1] 17 1799
We might be interested in checking that our per-group coverage is roughly equal. Lack of equal coverage might occur if a particular subset of samples amplify poorly and are hard to pool at the same concentration as other samples. They might be older samples, or from a different sampling location, that systematically exhibit lower coverage. We’ve had reviewers ask for these checks before, so it might be wise to do them as a matter of course.
Let’s plot reads as a function of site ID
#Make a data frame of read depths
soil_reads<-data.frame(reads=sample_sums(physeq100))
#Add on the sample ID
soil_reads$Sample<-rownames(soil_reads)
#Extract the Metadata from the phyloseq object
soil_meta<-data.frame(sample_data(physeq))
soil_meta$Sample<-rownames(soil_meta)
#Join on the Metadata
soil_reads<-left_join(soil_reads,soil_meta,"Sample")
#Some Boxplots of COverage by Population using ggplot2
#library(ggplot2)
ggplot(soil_reads,aes(x=site.name,y=reads)) + geom_boxplot(aes(fill=site.name)) + plotopts + theme(axis.text.x = element_text(angle = 45, hjust=1))
We need to interpret these with a bit of caution, as this a massively subsetted dataset with only a few samples per site, so estimates of among-site variability might be unreliable. Your own datasets will vary.
Adapt the code above to plot the range of sample reads for each of the two ‘Vegetation’ categories
Rarefaction curves allow us to assess whether we’ve likely discovered all the microbial ‘species’ / ASVs in a sample. The lower the true species diversity, and the higher the sampling depth (number of reads), the more likely we are to have saturated our species discovery curves. We can plot per-sample rates of species discovery using the ‘rarecurve’ function in vegan. By default this will plot a different line for each sample, so if you have hundreds of samples, these graphs can look very cluttered!
rarecurve(t(otu_table(physeq_bacteria)), step=50, cex=0.5)
Ok - now we’re ready to do some calculations. But first, we need to remove any artefacts caused by uneven sampling depth across samples, using the command rarefy_even_depth.
This command will automatically rarefy to the lowest-coverage sample sampling depth if you do not specify a sampling depth. This means you won’t lose any samples, but could potentially throw away lots of data if there’s a single library that amplified poorly.
What is super important is that you specify a random number seed. This is because the rarefying step is based on a random subsample of each library, and providing a random number seed means your analysis will be completely reproducible (rather than differing every time because of sampling error).
Note that rarefying libraries is quite a contentious issue. See this paper here for an explanation.
####### Rarefying Samples
#remind ourselves what the minimum coverage is
min(sample_sums(physeq))
## [1] 27
This is because there are some samples here that didn’t sequence very well at all - or lost a lot of reads in the QC steps in the bioinformatic workflow - or both. We can query our data and ask which samples have fewer than 500 reads
names(sample_sums(physeq100))[which(sample_sums(physeq100)<500)]
## [1] "BAQ2420.1.1" "BAQ2420.1.3" "BAQ2420.3" "BAQ2462.3" "BAQ2687.1"
## [6] "BAQ2687.3" "BAQ2838.1" "BAQ2838.2" "BAQ2838.3" "BAQ4166.1.1"
## [11] "YUN3259.1.1" "YUN3259.1.3" "YUN3259.2" "YUN3346.2" "YUN3533.1.2"
## [16] "YUN3856.2"
And how many there were in total
length(sample_sums(physeq_bacteria)[which(sample_sums(physeq_bacteria)<500)])
## [1] 5
So if we choose 500 as our sampling depth for rarefying, we will lose these samples as they don’t satisfy our minimum depth criterion. We can do that now. It’s very important that we we set a random number seed so that the workflow is reproducible:
# Rarefy to a custom minimum library size, set random number see for reproducibility
ps_rare<-rarefy_even_depth(physeq_bacteria,500,rngseed = 150517)
## `set.seed(150517)` was used to initialize repeatable random subsampling.
## Please record this for your records so others can reproduce.
## Try `set.seed(150517); .Random.seed` for the full vector
## ...
## 5 samples removedbecause they contained fewer reads than `sample.size`.
## Up to first five removed samples are:
## BAQ2838.2BAQ2838.3YUN3259.1.1YUN3259.1.3YUN3346.2
## ...
## 53OTUs were removed because they are no longer
## present in any sample after random subsampling
## ...
Note how we get helpful warnings about which samples we lost, and also which sequences we lost because they were found only in those samples. Again this demonstrates how phyloseq operates on the whole phyloseq object and how all the tables within an object are linked.
Much like using ‘subset’ on dataframes in R, phyloseq has some built in functions for manipulating phyloseq objects to select only certain samples, or sequences (taxa/ASVs) of interest.
Here we’ll make use of the prune_samples command that allows us to subset our phyloseq object to only certain sample of interest meeting a certain condition.
The general syntax is: prune_samples(condition to select, phyloseq object to use)
Subset Our Phyloseq Object to Only Samples Containing Vegetation Cover
#Prune
veg_yes<-prune_samples(sample_data(ps_rare)$vegetation =="yes",ps_rare)
Now we can check we’ve done what we think we’ve done.
#Inspect
veg_yes
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 1014 taxa and 32 samples ]
## sample_data() Sample Data: [ 32 samples by 20 sample variables ]
## tax_table() Taxonomy Table: [ 1014 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 1014 tips and 1010 internal nodes ]
table(sample_data(veg_yes)$vegetation)
##
## yes
## 32
Adapt the code above to subset the data to only those records where Average Relative Humidity is less than 80
We can make the above code without the complex loop, if that helps to see what the code is doing. Hopefully the answer here will be the same as the one above!
###### LOOP FREE VENN DIAGRAM CREATION
## repeat for each condition you want to look at
#Subset Based On Condition
veg_yes<-prune_samples(sample_data(ps_rare)$vegetation =="yes",ps_rare)
#Work out which ASVs are present (>0 abundance)
veg_yes_keep<-apply(otu_table(veg_yes),1,function(x){sum(x)>0})
#Subset to Only Those ASVS
veg_yes_sub<-prune_taxa(veg_yes_keep,veg_yes)
#Strip Out the Rownames of the Subsetted Object
veg_yes_asvs<-rownames(otu_table(veg_yes_sub))
head(veg_yes_asvs)
## [1] "99fa718995b3129c258bd12bdabb4bec" "2dcce6012bdc7a9cb15900ee79beb2ed"
## [3] "8133d2b0b8fcb93ae67ebaaec9667e7b" "996b8c93a5184a88d3ec0d3965fe31b2"
## [5] "d183373f95d7758e5883ac24de921410" "3a51360d9283ba62bef12229ba447c63"
#Same as above for No
veg_no<-prune_samples(sample_data(ps_rare)$vegetation =="no",ps_rare)
veg_no_keep<-apply(otu_table(veg_no),1,function(x){sum(x)>0})
veg_no_sub<-prune_taxa(veg_no_keep,veg_no)
veg_no_asvs<-rownames(otu_table(veg_no_sub))
#Make a List for the Venn Diagram
veg_list<-list(yes=veg_yes_asvs,no=veg_no_asvs)
#Make the Venn diagram
library(gplots)
venn(veg_list)
Alpha diversity generally corresponds to overall species (sequence) richness in a sample. Some alpha diversity metrics account for overall richness but also how those sequences are distributed among the samples (evenness). It’s important to note that these metrics will all be correlated because they all measure similar things. Some papers will report tests on multiple alpha diversity values, but this arguably exposes them to multiple testing issues - that is increased risk of finding a false positive by asking the same statistical question over and over.
Calculate Alpha Diversity for Samples
### Estimate lots of alpha diversity measures (note Chao1 should be ignored with DADA2 as Chao1 is based on frequency of singletons, but DADA2 doesnt allow singletons)
soil_richness<-estimate_richness(ps_rare,measures=c("Observed","Shannon","InvSimpson"))
head(soil_richness)
## Observed Shannon InvSimpson
## BAQ2420.1.1 23 2.756977 12.63903
## BAQ2420.1.2 26 2.921896 15.31863
## BAQ2420.1.3 36 3.248181 20.08677
## BAQ2420.2 33 3.319443 24.39500
## BAQ2420.3 36 3.343809 24.84596
## BAQ2462.1 34 3.093297 14.04968
Proof That Alpha Diversity Metrics are Correlated
with(soil_richness,plot(Shannon,InvSimpson))
We can inspect the spread of our richness measures here:
subset(soil_richness,Observed<5)
## [1] Observed Shannon InvSimpson
## <0 rows> (or 0-length row.names)
ggplot(soil_richness,aes(x=Observed)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Rarefying is a random subsample of our reads to equalise read depth, and from these we can calculate traits of interest like ASV-richness. But are we worried that the answer will change depending on that single random subsample? Can we fix this by repeating the rarefying multiple times to calculate our traits of interest? Yes we can!
Bad news is we will be using loops to do this. For each iteration, we will: i) rarefy, ii) calculate and store a **per-sample* alpha diversity value iii) store the values to perform calculations on later
## How Many Times Should We Rarefy?
n_rarefy_sims<-100
#What Should Our Rarefying Threshold Be?
rarefy_depth<- 500
#How Many Samples Are We Expecting?
nsamples_permute<-length(sample_sums(physeq_bacteria)[which(sample_sums(physeq_bacteria)>=rarefy_depth)])
#An empty matrix to store our data
richvals_matrix<-matrix(NA,ncol=n_rarefy_sims,nrow=nsamples_permute)
##### RUN THE LOOP
for (k in 1:n_rarefy_sims){
#New Rarefying Step - no random number seed or all results will be the same
ps_rare_temp<-rarefy_even_depth(physeq_bacteria,rarefy_depth)
#Calculate a Trait - we'll use Shannon Diversity
richvals_matrix[,k]<-as.matrix(estimate_richness(ps_rare_temp,measures="Shannon"))[,1]
}
We’ve now made a matrix where each row represents error in the estimation of a trait of interest due to rarefying. It might be useful to start by visualizing this per-sample uncertainty
#Name the Samples
rownames(richvals_matrix)<- sample_names(rarefy_even_depth(physeq_bacteria,rarefy_depth))
## You set `rngseed` to FALSE. Make sure you've set & recorded
## the random seed of your session for reproducibility.
## See `?set.seed`
## ...
## 5 samples removedbecause they contained fewer reads than `sample.size`.
## Up to first five removed samples are:
## BAQ2838.2BAQ2838.3YUN3259.1.1YUN3259.1.3YUN3346.2
## ...
## 62OTUs were removed because they are no longer
## present in any sample after random subsampling
## ...
#Format FOr Plotting
t(richvals_matrix) %>% as.data.frame() %>% pivot_longer(everything(),names_to = "sample",values_to = "shannon") -> rarefy_permute_plotdata
##Plot
ggplot(rarefy_permute_plotdata,aes(x=sample,y=shannon)) + geom_violin(aes(fill=sample)) + theme(legend.position="none") + plotopts + theme(axis.text.x = element_text(angle = 45, hjust=1,size=10))
Hopefully you can see from this that variation AMONG SAMPLES (mostly biological variation) is far greater than variation WITHIN SAMPLES introduced by our choice of rarefying threshold (statistical variation).
We can also plot average value produced from these permutations against the single-step values from above
plot(apply(richvals_matrix,1,mean), as.numeric(estimate_richness(ps_rare,measures="Shannon")[,1]))
abline(0,1)
This is encouraging, though you might notice some deviations from the expected 1:1 line
Repeat the above exercise but change the rarefying threshold to your choice of value. Smaller values will likely work better for this dataset
Phyloseq has built in plotting functions for alpha diversity, where you can choose the metric you wish to plot. Note you don’t have to have used ‘estimate_richness’ first for this to work. We can use the ‘x=’ argument to specify how to group the alpha diverstiy metrics. First we can just plot the Shannon diversity for all samples separately:
## Make a plot of our chosen richness estimator by individual sample
plot_richness(ps_rare,x="samples",measures=c("Shannon"))
But this isn’t very useful. More interesting would be to see if alpha diversity varied by group, in this case good old Vegetation
## Make a plot of our chosen richness estimator by population
plot_richness(ps_rare,x="vegetation",measures=c("Shannon"))
Plot the alpha diversity metric of your choice by site and transect.
But we might want to plot the relationship between alpha diversity and other variables, or even do the same but controlling for a third variable. Here we plot the relationship between alpha diversity and Elevation.
Because phyloseq calls ggplot, you can add ggplot customisation layers to the plot that specify aesthetics like plotting character, size and fill, and axis text sizes
plot_richness(ps_rare,x="elevation",measures=c("Shannon")) + geom_point(size=5,pch=21,fill="white") + theme_classic() + labs(x="Elevation(m)",y="Alpha Diversity (Shannon)") + theme(axis.text=element_text(size=12),axis.title = element_text(size=12))
Definitely looks like there’s something going on here, which we’ll explore formally later.
We might want to do the same, but manipulate the fill colour of the points depending on a third variable - in this case Vegetation status
plot_richness(ps_rare,x="elevation",measures=c("Shannon")) + geom_point(size=5,pch=21,aes(fill=vegetation)) + theme_bw() + labs(x="Elevation(m)",y="Alpha Diversity (Shannon)") + theme(axis.text=element_text(size=12),axis.title = element_text(size=12))
Note that we can also achieve the above manually by extracting our richness estimates created with the ‘estimate_richness’ function and tacking on our metadata, followed by simular plotting arguments. First we add a column called ‘Sample’ to the richness dataframe to be used as an index for joining the two dataframes.
#Add Richness Onto Our Metadata
soil_richness$Sample<-rownames(soil_richness)
soil_richness<-left_join(soil_richness,soil_meta,"Sample")
#Plot the correlation between heterozygosity and richness (shannon diversity)
ggplot(soil_richness,aes(x=elevation,y=Shannon)) + geom_point(size=5,pch=21,fill="white") + theme_bw() + labs(x="Elevation(m)",y="Alpha Diversity (Shannon)") + theme(axis.text=element_text(size=12),axis.title = element_text(size=12))
Looks like there’s something going on. How about we ask if there’s any correlation between alpha diversity and elevation
with(soil_richness,cor.test(elevation,Shannon))
##
## Pearson's product-moment correlation
##
## data: elevation and Shannon
## t = 4.4477, df = 47, p-value = 5.291e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3106001 0.7158941
## sample estimates:
## cor
## 0.5442611
Looks convincing, but what we haven’t done here is account for population effects - Site and Transect. So let’s make the same plot again but identify which Transects individuals come from. Code versions below for plotting directly from phyloseq and from our metadata dataframe.
##### USING OUR COMBINED DATAFRAME
# Same as above, but identify which population the individuals come from
pop1<-ggplot(soil_richness,aes(x=elevation,y=Shannon)) + geom_point(size=5,pch=21,aes(fill=site.name)) + theme_bw() + labs(x="Elevation(m)",y="Alpha Diversity (Shannon)") + theme(axis.text=element_text(size=12),axis.title = element_text(size=12))
pop1
# Same as above, but 'facet' each population to a separate panel
pop1 + facet_wrap(.~transect.name)
##### USING PHYLOSEQ
pop2<-plot_richness(ps_rare,x="elevation",measures=c("Shannon")) + geom_point(size=5,pch=21,aes(fill=site.name)) + theme_bw() + labs(x="Elevation (m)",y="Alpha Diversity (Shannon)") + theme(axis.text=element_text(size=12),axis.title = element_text(size=12))
pop2
# Same as above, but 'facet' each population to a separate panel
pop2 + facet_grid(.~transect.name)
If we don’t like the default colours R gives us, we can choose our own. R Color Brewer is brilliant for this and works directly within its own ggplot call. We can even just use the ‘pop1’ object we’ve already created to change the colours
#Plot the correlation between heterozygosity and richness (shannon diversity)
elev_shannon_default<- ggplot(soil_richness,aes(x=elevation,y=Shannon)) + geom_point(size=5,pch=21,aes(fill=transect.name)) + theme_bw() + labs(x="Elevation(m)",y="Alpha Diversity (Shannon)") + theme(axis.text=element_text(size=12),axis.title = element_text(size=12))
elev_shannon_default
######## Adding Some Custom Plotting Colours
elev_shannon_default + scale_fill_brewer(palette = "Set3")
Have a look at all the R Color Brewer palettes and make the graph look how you want it to. You can replace the palette names directly above e.g.“Set3” etc
display.brewer.all()
The package ‘cowplot’ has some flexible implementations for saving ggplot-style plots at publication quality and in a range of formats. All you need to do is have all your layers saved as one object for export. Let’s say the following is our preferred graph:
pop1 + scale_fill_brewer(palette="Set1")
We can save this colour layer as a new object and export using cowplot’s implementation of ‘ggsave2’. The general syntax is:
ggsave2(file name of output file.FORMAT, saved graph object)
The nice thing is we might want to save it as both a pdf and a tiff file, and simply changing the file format in the filename will do so. If a journal asks for a minimum resolution / dpi, you can specify that too. (see ?ggsave2)
#All Layers Together
pop_custom_colours<-pop1 + scale_fill_brewer(palette="Set1")
#PDF Version
ggsave2('Shannon Diversity by Transect.pdf',pop_custom_colours)
## Saving 7 x 5 in image
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Set1 is 9
## Returning the palette you asked for with that many colors
#TIFF version
ggsave2('Shannon Diversity by Transect.tiff',pop_custom_colours)
## Saving 7 x 5 in image
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Set1 is 9
## Returning the palette you asked for with that many colors
#PDF Version With Custom Size in Inches
ggsave2('Shannon Diversity by Transect LARGE.pdf',pop_custom_colours,width=12,height=10,units="in")
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Set1 is 9
## Returning the palette you asked for with that many colors
Next we can look at metrics of community structure - differences among samples in the distribution and relative abundance of our microbial taxa
First, we can visualise these differences using a heatmap, which will show us differential abundances of each ASV within and across samples. First we subset the data to only the 50 most abundant ASVs to make the plot simpler to interpret.
#################### HEATMAP (baked into phyloseq)
#Subset to Most 50 abundant OTUs
ps_rare_top50 <- prune_taxa(names(sort(taxa_sums(ps_rare),TRUE)[1:50]), ps_rare)
Next, we plot the heatmap By default, the heatmap will cluster samples along the x axis based on similarity of total community structure. Here we’d hope that samples cluster by population.
#Plot Heatmap with X axis ordered Inter-Sample Distance
plot_heatmap(ps_rare_top50,"NMDS",distance = "bray")
## Warning: Transformation introduced infinite values in discrete y-axis
They do, but there is some ‘noise’ here. Still, this is convincing evidence that samples within a Transect are more similar to one another in terms of microbial community structure than to samples in different Transects. It also makes it easy to spot the samples seemingly dominated by just one sequence.
If we don’t want that behaviour, we can manually override the clustering and tell it to group samples by a variable of interest, such as river
#Plot Heatmap with X axis ordered by population
plot_heatmap(ps_rare_top50,"NMDS",distance = "bray",sample.label="site.name",sample.order = "site.name")
## Warning: Transformation introduced infinite values in discrete y-axis
There are plenty of heatmap packages in R, and one I quite like is the pheatmap package. It allows you to annotate the heatmaps with useful metadata, like sampling site etc.
Here we’ll plot a heatmap with pheatmap, and annotate samples based on whether there was vegetation on that sample.Note we’re not clustering by column (sample) here, just letting them be plotted in site-order as they appear in the data frame.
#library(pheatmap)
#Generate Metadata for Plotting Colours
veg_data<-data.frame(Vegetation=sample_data(ps_rare_top50)$vegetation)
rownames(veg_data)<-rownames(sample_data(ps_rare_top50))
#Plot Heatmap
pheatmap(otu_table(ps_rare_top50),cluster_cols = FALSE,scale="row",annotation_col = veg_data)
pheatmap(otu_table(ps_rare_top50),cluster_cols = TRUE,scale="row",annotation_col = veg_data)
Modify the Code above to use R Color Brewer to change the plotting colours.
We can summarise our microbial communities using compositional barplots, which provide a quick and accessible way of visualising differences in taxonomy at different hierarchies.
The first step here is to remove samples in our phyloseq object with low reads, otherwise this will break our averaging
#Filter Samples With No Data
physeq_subset<-prune_samples(sample_sums(physeq_bacteria)>0,physeq_bacteria)
Next we tell phyloseq to aggregate reads at the taxonomic level of interest - here Phylum - and to transform the data to relative abundance (fraction of reads, rather than count of reads)
#Aggregate To Phylum Level and Transform to Relative Abundance
physeq_phylum <- physeq_subset %>%
aggregate_taxa(level = "Phylum") %>%
microbiome::transform(transform = "compositional")
Then we can plot the composition for each sample, order by the abundance of the Phylum Firmicutes (you can change this to any phylum present)
#Plot Composition By Sample ID
physeq_sample_plot <- physeq_phylum %>%
plot_composition(sample.sort = "Firmicutes")
physeq_sample_plot
Alternatively, we can group our data by sample metadata of interest - here we can bin all samples from the same transect together.
#Plot Composition Grouped By Sample Metadata
physeq_grouped_plot <- physeq_phylum %>%
plot_composition(sample.sort = "Firmicutes", average_by = "transect.name")
physeq_grouped_plot
These graphs can get messy sometimes, so it’s often useful to subset the data to some of the most abundant members of the taxonomy of interest. Here we’ll calculate the top 5 most abundant phyla in our data, subset the phyloseq object to those data, and remake our graph.
#### Run Again But With Top 'N' Phyla
#What Are the Names of the most abundant phyla?
physeq_phylumcollapse<- physeq_subset %>% aggregate_taxa(level="Phylum")
physeq_top5phyla = names(sort(taxa_sums(physeq_phylumcollapse), TRUE)[1:5])
#Subset the phyloseq object to those phyla
physeq_top5phylum_filter<-subset_taxa(physeq_subset,Phylum %in% physeq_top5phyla)
#Remake Our Graph but with grouping by VEGETATION
physeq_top5phylum_plot <- physeq_top5phylum_filter %>%
aggregate_taxa(level = "Phylum") %>%
microbiome::transform(transform = "compositional") %>%
plot_composition(sample.sort = "Firmicutes", average_by = "vegetation")
physeq_top5phylum_plot
We can also add custom colours to these graphs to make the distinctions between categories easier
physeq_top5phylum_plot + scale_fill_brewer(palette="Set3")
Make a compositional barplot where taxa are aggregated at the FAMILY Level, split by site
Make a compositional barplot where samples are binned into SITES, not transects.
It can be useful to visualize microbial community structure rather than rely on a matrix of numbers. There are many ways to do so, all with jazzy acronyms, including Non-Metric Multidimensional Scaling (NMDS), Principal Coordinates Analysis (PCoA), Constrained Correspondence Analysis (CCA), Detrended Correspondence Analysis (DCA) etc. They each make different assumptions about the data, but they share the same principle of reducing variation in the relative abundances hundreds of microbial taxa into 2 or 3 axes that can then be plotted/visualised, or statistically tested.
For an excellent guide to the pros and cons of all of these methods, see this paper by Paliy & Shankar
Here we’ll use the built in phyloseq functions to perform NMDS ordination on Bray-Curtis distances among samples. We specify ‘k=2’ to state that the variation should be condensed into 2 axes.
As a general rule, the ‘stress’ value of an NMDS ordination should be below 15% / 0.15, and ideally below 10%. Stress is a measure of how ‘easy’ it was to reduce the variation in your microbial abundance matrix into the number of axes you specify. If the stress value is too large, or the model doesn’t converge, then try setting k to 3. If you do this, make sure to state in your methods of your paper that the ordination was done across 3 axes.
#################### ORDINATION WITHIN PHYLOSEQ
### Ordination using built in functions in phyoseq (calls vegan)
ord.nmds.bray <- ordinate(ps_rare, method="NMDS",k=3, distance="bray",trymax=50)
## Square root transformation
## Wisconsin double standardization
## Run 0 stress 0.1311407
## Run 1 stress 0.130744
## ... New best solution
## ... Procrustes: rmse 0.08235512 max resid 0.2411586
## Run 2 stress 0.1297148
## ... New best solution
## ... Procrustes: rmse 0.04422977 max resid 0.2710738
## Run 3 stress 0.1255234
## ... New best solution
## ... Procrustes: rmse 0.0919183 max resid 0.2401458
## Run 4 stress 0.1288516
## Run 5 stress 0.1266376
## Run 6 stress 0.1275053
## Run 7 stress 0.1266969
## Run 8 stress 0.1264839
## Run 9 stress 0.1256807
## ... Procrustes: rmse 0.01633202 max resid 0.06366843
## Run 10 stress 0.1280006
## Run 11 stress 0.1265218
## Run 12 stress 0.1281518
## Run 13 stress 0.126604
## Run 14 stress 0.1270444
## Run 15 stress 0.1306928
## Run 16 stress 0.1290179
## Run 17 stress 0.12919
## Run 18 stress 0.1290175
## Run 19 stress 0.1254496
## ... New best solution
## ... Procrustes: rmse 0.005840751 max resid 0.02249395
## Run 20 stress 0.1293103
## Run 21 stress 0.127207
## Run 22 stress 0.1264251
## Run 23 stress 0.1289484
## Run 24 stress 0.1269614
## Run 25 stress 0.1266924
## Run 26 stress 0.1286795
## Run 27 stress 0.1254753
## ... Procrustes: rmse 0.003605075 max resid 0.01789077
## Run 28 stress 0.1264333
## Run 29 stress 0.1293387
## Run 30 stress 0.1254494
## ... New best solution
## ... Procrustes: rmse 0.0006687956 max resid 0.002601191
## ... Similar to previous best
## *** Solution reached
#scores(ord.nmds.bray)
#Make Ordination Plot and Store it
ord1<-plot_ordination(ps_rare, ord.nmds.bray, color="transect.name", title="Bray NMDS")
#Plot with Ellipses assuming normal distribution
ord1 #+ stat_ellipse(type="norm") + theme_bw()
#Same But with Filled Ellipses (note that 'alpha' changes how transluscent the fill is)
ord1 + stat_ellipse(geom="polygon",aes(fill=transect.name),type="norm",alpha=0.4) + theme_bw()
Modify the Code above to use R Color Brewer to change the plotting colours.
Now we’ll do the same using vegan’s equivalent functions. To do so we need to extract the relevant dataframes from our phyloseq object.
#################### ORDINATION USING VEGAN
#### Vegan Function to Convert Phyloseq into Something Vegan can call directly
vegan_otu <- function(physeq) {
OTU <- otu_table(physeq)
if (taxa_are_rows(OTU)) {
OTU <- t(OTU)
}
return(as(OTU, "matrix"))
}
#Load Libraries
#library(vegan)
#library(RColorBrewer)
#Convert OTU table to abundance matrix
soil.v<-vegan_otu(ps_rare)
#Convert Sample Data to
soil.s<-as(sample_data(ps_rare),"data.frame")
###### NMDS Ordination
soil.nmds<-metaMDS(soil.v,k=3,distance="bray",trymax = 50)
## Square root transformation
## Wisconsin double standardization
## Run 0 stress 0.1311407
## Run 1 stress 0.1287611
## ... New best solution
## ... Procrustes: rmse 0.08871537 max resid 0.2775393
## Run 2 stress 0.1299215
## Run 3 stress 0.1270068
## ... New best solution
## ... Procrustes: rmse 0.08542 max resid 0.2763549
## Run 4 stress 0.129896
## Run 5 stress 0.1255131
## ... New best solution
## ... Procrustes: rmse 0.04094959 max resid 0.2534407
## Run 6 stress 0.1264248
## Run 7 stress 0.1270382
## Run 8 stress 0.1282543
## Run 9 stress 0.1292836
## Run 10 stress 0.1293186
## Run 11 stress 0.1299762
## Run 12 stress 0.1254493
## ... New best solution
## ... Procrustes: rmse 0.006136808 max resid 0.02140133
## Run 13 stress 0.1292856
## Run 14 stress 0.1292054
## Run 15 stress 0.1291509
## Run 16 stress 0.1267317
## Run 17 stress 0.1308613
## Run 18 stress 0.1306514
## Run 19 stress 0.1274428
## Run 20 stress 0.1264059
## Run 21 stress 0.1265212
## Run 22 stress 0.1292011
## Run 23 stress 0.1284072
## Run 24 stress 0.1301848
## Run 25 stress 0.128199
## Run 26 stress 0.1254496
## ... Procrustes: rmse 0.0007108842 max resid 0.002519635
## ... Similar to previous best
## *** Solution reached
stressplot(soil.nmds)
#scores(soil.nmds)
##################### NMDS Plot using VEGAN
#Select Some Colours from RColorBrewer
ordination_cols<-brewer.pal(5,"Set2")
cols<-data.frame(pop=unique(soil.s$transect.name),col=ordination_cols[1:length(unique(soil.s$transect.name))])
#Expand so that each sample in the database has a colour value
cols.expand<-as.character(cols[,2][match(soil.s$transect.name,cols[,1])])
cols<-cols[order(cols[,1]),]
#Plot Axis Bounds
plot(soil.nmds,display="sites",type="n",las=1,ylab="",xaxt="n",yaxt="n",xlab="")
#Add On A Spider linking each point to its population centroid
ordispider(soil.nmds,soil.s$transect.name,col="blue",label=F)
#Add the data points
with(soil.s,points(soil.nmds,display="sites",pch=21,bg=cols.expand,cex=2))
#Add an ellipse
#'ehull' is ellipsoid hull that encloses all points within a group
#'#?ordiellipse for more options under option 'kind'
ordiellipse(soil.nmds,soil.s$transect.name,col="blue",label=F,kind="ehull")
#Add Axis Labels
mtext(c('NMDS 1', 'NMDS 2'), side=c(1,2), adj=1, cex=1.5,font=2, line=c(1, 0))
with(soil.s,legend("bottomright",legend=cols[,1],pch=21,pt.bg=as.character(cols[,2]),bty="n",pt.cex=2))
### Note You Can play with the axis limits to frame the plot more accurately
PERMANOVA is a randomisation procedure that will test for the effects of predictors of interest in driving differences in beta diversity / community structure. It runs in vegan, so we need the files we converted for vegan. ‘soil.v’ is our ASV abundance matrix, and the ‘data’ argument needs our sample metadata.
The nice thing about permanova is that it provides r2 values for effects as well as p values, so you can get an idea of % of explained variance and ‘importance’ of effects.
######################### TESTING FOR DIFFERENCES WITH PERMANOVA
soil.adonis<-adonis(soil.v ~ factor(transect.name) + vegetation ,data=soil.s,permutations=10000,method="bray")
soil.adonis #pop rsq = ~59% p <0.0001
##
## Call:
## adonis(formula = soil.v ~ factor(transect.name) + vegetation, data = soil.s, permutations = 10000, method = "bray")
##
## Permutation: free
## Number of permutations: 10000
##
## Terms added sequentially (first to last)
##
## Df SumsOfSqs MeanSqs F.Model R2 Pr(>F)
## factor(transect.name) 1 0.9159 0.91588 2.1552 0.04172 9.999e-05 ***
## vegetation 1 1.4904 1.49038 3.5071 0.06789 9.999e-05 ***
## Residuals 46 19.5480 0.42496 0.89040
## Total 48 21.9543 1.00000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
soil.adonis_site<-adonis(soil.v ~ factor(site.name) + factor(transect.name) + vegetation ,data=soil.s,permutations=10000,method="bray")
Transect ID only explains ~4% of our variation in microbial community structure.
Fit a PERMANOVA model investigating the effect of Extraction Batch on microbial community structure
An alternative to Bray-Curtis distances among samples is to use the Unifrac distance. Unifrac takes into account the phylogenetic relatedness amongst community members using a sequence tree built from your sequences. ‘Unweighted’ unifrac simply uses genetic distance, whereas ‘Weighted’ unifrac takes into account the relative abundance of each microbial taxon using the abundance table
######### UNIFRAC ORDINATION PLOTS
## Unweighted Unifrac
### Ordination using built in functions in phyoseq (calls vegan)
ord_nmds_unifrac_unweighted <- ordinate(ps_rare, method="PCoA", distance="unifrac",weighted=FALSE)
## Warning in matrix(tree$edge[order(tree$edge[, 1]), ][, 2], byrow = TRUE, : data
## length [2023] is not a sub-multiple or multiple of the number of rows [1012]
# #Make Ordination Plot and Store it
ord_unifrace_uw1<-plot_ordination(ps_rare, ord_nmds_unifrac_unweighted, color="transect.name", title="Unweighted Unifrac NMDS")
ord_unifrace_uw1 + stat_ellipse(geom="polygon",aes(fill=transect.name),type="norm",alpha=0.4) + theme_bw()
## Weighted Unifrac
### Ordination using built in functions in phyoseq (calls vegan)
ord_nmds_unifrac_weighted <- ordinate(ps_rare, method="PCoA", distance="unifrac",weighted=TRUE)
## Warning in matrix(tree$edge[order(tree$edge[, 1]), ][, 2], byrow = TRUE, : data
## length [2023] is not a sub-multiple or multiple of the number of rows [1012]
#Make Ordination Plot and Store it
ord_unifrace_w1<-plot_ordination(ps_rare, ord_nmds_unifrac_weighted, color="transect.name", title="Weighted Unifrac NMDS")
ord_unifrace_w1 + stat_ellipse(geom="polygon",aes(fill=transect.name),type="norm",alpha=0.4) + theme_bw()
library(rbiom)
##
## Attaching package: 'rbiom'
## The following object is masked from 'package:dplyr':
##
## select
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:DESeq2':
##
## counts
## The following object is masked from 'package:S4Vectors':
##
## metadata
## The following object is masked from 'package:BiocGenerics':
##
## counts
## The following object is masked from 'package:vegan':
##
## rarefy
## The following objects are masked from 'package:phyloseq':
##
## nsamples, ntaxa, sample.names
#Strip Out the ASV matrix and Tree from Our Phyloseq Object
counts = otu_table(ps_rare)
tree = phy_tree(ps_rare)
#We can calculate weighted and unqieghted UNIFRAC like so:
rbiom_weighted = rbiom::unifrac(counts, weighted=TRUE, tree=tree)
rbiom_unweighted = rbiom::unifrac(counts, weighted=FALSE, tree=tree)
## And then perform calculations like PERMANOVA and Ordinations on them in VEGAN
soil.s<-as(sample_data(ps_rare),"data.frame") # strip out the sample data
##PERMANOVA - Are Sites DIfferent?
permanova_site = adonis(
formula=rbiom_weighted ~ site.name,
data=soil.s,
permutations=999,
)
permanova_site #58 of variance explained!
##
## Call:
## adonis(formula = rbiom_weighted ~ site.name, data = soil.s, permutations = 999)
##
## Permutation: free
## Number of permutations: 999
##
## Terms added sequentially (first to last)
##
## Df SumsOfSqs MeanSqs F.Model R2 Pr(>F)
## site.name 16 1.2887 0.080545 2.9497 0.59594 0.001 ***
## Residuals 32 0.8738 0.027306 0.40406
## Total 48 2.1625 1.00000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Weighted UNIFRAC ordination
soil_mds_weighted = metaMDS(
as.dist(rbiom_weighted)
)
## Run 0 stress 0.1374004
## Run 1 stress 0.1592184
## Run 2 stress 0.1654294
## Run 3 stress 0.1497536
## Run 4 stress 0.1502705
## Run 5 stress 0.1546879
## Run 6 stress 0.1553554
## Run 7 stress 0.1573829
## Run 8 stress 0.1397793
## Run 9 stress 0.1542538
## Run 10 stress 0.137744
## ... Procrustes: rmse 0.01497263 max resid 0.09721065
## Run 11 stress 0.157242
## Run 12 stress 0.137363
## ... New best solution
## ... Procrustes: rmse 0.002410947 max resid 0.01211236
## Run 13 stress 0.1689045
## Run 14 stress 0.1614346
## Run 15 stress 0.150911
## Run 16 stress 0.4021143
## Run 17 stress 0.1378266
## ... Procrustes: rmse 0.01538004 max resid 0.09684462
## Run 18 stress 0.1498033
## Run 19 stress 0.1462731
## Run 20 stress 0.1497546
## *** No convergence -- monoMDS stopping criteria:
## 19: stress ratio > sratmax
## 1: scale factor of the gradient < sfgrmin
#Plot with Vegan
plot(soil_mds_weighted,display="sites",type="n",las=1,ylab="",xaxt="n",yaxt="n",xlab="")
#Add On A Spider linking each point to its population centroid
ordispider(soil_mds_weighted,soil.s$transect.name,col="blue",label=T)
#Add the data points
with(soil.s,points(soil_mds_weighted,display="sites",pch=21,cex=2,bg="white"))
###### More COmplex Model
permanova_site_transect = adonis(
formula=rbiom_weighted ~ transect.name + vegetation + site.name,
data=soil.s,
permutations=999,
)
permanova_site_transect #58 of variance explained!
##
## Call:
## adonis(formula = rbiom_weighted ~ transect.name + vegetation + site.name, data = soil.s, permutations = 999)
##
## Permutation: free
## Number of permutations: 999
##
## Terms added sequentially (first to last)
##
## Df SumsOfSqs MeanSqs F.Model R2 Pr(>F)
## transect.name 1 0.12840 0.12840 4.7023 0.05938 0.001 ***
## vegetation 1 0.35151 0.35151 12.8730 0.16255 0.001 ***
## site.name 14 0.80881 0.05777 2.1157 0.37401 0.001 ***
## Residuals 32 0.87380 0.02731 0.40406
## Total 48 2.16252 1.00000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
###### More COmplex Model
permanova_site_transect_noveg = adonis(
formula=rbiom_weighted ~ site.name + transect.name ,
data=soil.s,
permutations=999,
)
permanova_site_transect_noveg
##
## Call:
## adonis(formula = rbiom_weighted ~ site.name + transect.name, data = soil.s, permutations = 999)
##
## Permutation: free
## Number of permutations: 999
##
## Terms added sequentially (first to last)
##
## Df SumsOfSqs MeanSqs F.Model R2 Pr(>F)
## site.name 16 1.2887 0.080545 2.9497 0.59594 0.001 ***
## Residuals 32 0.8738 0.027306 0.40406
## Total 48 2.1625 1.00000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
When we fit NMDS models to microbiome data, we’re making no assumptions about how the data correspond to sample groups of interest. Rather, we ordinate samples in n-dimensional space, corresponding to the number of axes we choose to ordinate to, and then overlay our sample metadata on top to look for patterns.
CCA, or Constrained Correspondence Analysis, does something different. We can fit a model to the ASV / distance matrix and attempt to constrain variation in our data into that explained by our variables of interest. The second step is to ordinate the unconstrained residual information in the data. The advantage is that we get some statistics about our variables of interest alongside the ordination. Let’s repeat some analysis above using CCA
soil_cca<- cca(soil.v ~ transect.name + site.name,data=soil.s)
soil_cca
## Call: cca(formula = soil.v ~ transect.name + site.name, data = soil.s)
##
## Inertia Proportion Rank
## Total 26.0296 1.0000
## Constrained 10.1162 0.3886 16
## Unconstrained 15.9134 0.6114 32
## Inertia is scaled Chi-square
## Some constraints were aliased because they were collinear (redundant)
##
## Eigenvalues for constrained axes:
## CCA1 CCA2 CCA3 CCA4 CCA5 CCA6 CCA7 CCA8 CCA9 CCA10 CCA11
## 0.9549 0.9008 0.8345 0.7722 0.7169 0.6921 0.6588 0.6333 0.6325 0.6007 0.5319
## CCA12 CCA13 CCA14 CCA15 CCA16
## 0.5228 0.5091 0.4671 0.3594 0.3293
##
## Eigenvalues for unconstrained axes:
## CA1 CA2 CA3 CA4 CA5 CA6 CA7 CA8
## 0.8691 0.8466 0.8218 0.7824 0.6930 0.6532 0.6174 0.6063
## (Showing 8 of 32 unconstrained eigenvalues)
We use the ‘Anova’ function in the ‘car’ package, which changes something called the ‘sum of squares’ to makes the results insensitive to the order they’re added. This is super useful, and makes the results more reproducible.
car::Anova(soil_cca)
## Analysis of Deviance Table (Type II tests)
##
## Response: soil.v
## Df Chisq Pr(>Chisq)
## transect.name 1 74.01 < 2.2e-16 ***
## site.name 15 40414.14 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Fit a model to test for differences among vegetation types using CCA.
These models also work with continuous predictors. Here we look at the effect of pH. First we fit the model (replacing some missing values with the mean of pH first):
soil.s$ph[which(is.na(soil.s$ph))]<-mean(soil.s$ph,na.rm=T)
soil_env<- cca(soil.v ~ elevation+ ph, data=soil.s)
Then we can inspect the results, as above:
soil_env
## Call: cca(formula = soil.v ~ elevation + ph, data = soil.s)
##
## Inertia Proportion Rank
## Total 26.02958 1.00000
## Constrained 1.56568 0.06015 2
## Unconstrained 24.46390 0.93985 46
## Inertia is scaled Chi-square
##
## Eigenvalues for constrained axes:
## CCA1 CCA2
## 0.8700 0.6957
##
## Eigenvalues for unconstrained axes:
## CA1 CA2 CA3 CA4 CA5 CA6 CA7 CA8
## 0.9634 0.8815 0.8576 0.8359 0.8003 0.7890 0.7767 0.7685
## (Showing 8 of 46 unconstrained eigenvalues)
And test for the effects in the model:
car::Anova(soil_env)
## Analysis of Deviance Table (Type II tests)
##
## Response: soil.v
## Df Chisq Pr(>Chisq)
## elevation 1 950.152 < 2.2e-16 ***
## ph 1 17.321 3.157e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(soil_cca)
> Fit a CCA model investigating the effect of Average Soil Relative Humidity on microbial community structure. Compare the results and output here to that from PERMANOVA
The ‘microbiome’ package has some useful functions for investigating beta diversity traits of samples. There’s a nice guide here. Here we’ll calculate ‘divergence’ - a measure of how much variation there is in microbiome structure within, versus among groups of interest. Some groups might all have very similar microbiome profiles (low divergence), and others very distinct profiles. Differences in divergence among groups might tell us if there are population specific processes shaping microbiome structure.
Here I refer to groups as (statistical) populations, but they could be anything. Here we’ll look at microbiome divergence within sites
##################### CALCULATE WITHIN POPULATION DIVERGENCE
#Calculate Within-Pop Divergence
library(microbiome)
#Extract Population Info
popvals<-as.character(unique(sample_data(ps_rare)$transect.name))
popvals
## [1] "Baquedano" "Yungay"
## Make an Empty List to Store Divergence Values
div.list<-rep(list(NA),length(popvals))
for(k in 1:length(popvals)){
#Subset to Reference Group
div.temp<-subset_samples(ps_rare,transect.name==popvals[k])
#Calculate Median for Each ASV
div.reference <- apply(abundances(div.temp), 1, mean) # median often used instead but this dataset is funny
div.list[[k]]<-divergence(div.temp,div.reference,method="bray")
}
#Expand Divergence Values from List to Data Frame
divergence_values<-data.frame(divergence=unlist(div.list),Sample=names(unlist(div.list)))
#Add On Divergence Values
soil.s$Sample<-rownames(soil.s)
soil_divergence<-left_join(soil.s,divergence_values,"Sample")
#Plot Divergence By Pop
ggplot(soil_divergence,aes(x=transect.name,y=divergence)) + geom_violin(aes(fill=transect.name))
Network analysis allows us to visualise co-occurrence of microbial taxa, or similarity of samples based on microbial community profiles. First, let’s make a network of microbial taxa. Normally, we’d only want to present a subset of our microbial taxa so the network doesn’t become too unwieldy - but we’ve already subsetted the data to only taxa with >100 reads for our analysis
ig <- make_network(ps_rare, "taxa", distance = "jaccard", max.dist = 0.95)
plot_network(ig, ps_rare, type="taxa", point_size = 5, label=NULL, color="Class", line_alpha = 0.05)
Next we’ll do the same, but with our samples. The advantage here is that we can overlay important sample metadata over the top of sample points to see if these data predict network similarity / structure. We’ll do that here with the Vegetation variable.
ig <- make_network(ps_rare, "samples", distance = "jaccard", max.dist = 0.95)
plot_network(ig, ps_rare, type="samples", point_size = 5, label=NULL, color="vegetation", line_alpha = 0.05)
data(enterotype)
ig <- make_network(enterotype, max.dist=0.3,distance="bray")
plot_network(ig, enterotype, color="SeqTech", shape="Enterotype", line_weight=0.3, label=NULL)
Adapt the code above to vary the ‘max.dist’ threshold and see how the network changes at different levels of stringency
############### BETA DIVERISTY PLOT AND TESTS BASED ON CLR-TRANSFORMED DATA OF UNRAREFIED LIBRARY SIZES (XAH)
#### Vegan Function to Convert Phyloseq into Something Vegan can call directly
vegan_otu <- function(physeq) {
OTU <- otu_table(physeq)
if (taxa_are_rows(OTU)) {
OTU <- t(OTU)
}
return(as(OTU, "matrix"))
}
########### Transform Sample Counts
library(microbiome)
#CLR Transform on unrarefied object
ps_clr <- microbiome::transform(physeq100, "clr")
head(otu_table(ps_clr))
## OTU Table: [6 taxa and 54 samples]
## taxa are rows
## BAQ2420.1.1 BAQ2420.1.2 BAQ2420.1.3 BAQ2420.2
## 9e5c513fb56d18593fe6a31808b722d8 -0.2212132 -0.3628818 -0.2650607 -0.3584369
## fce1dd8b549c362b7abf3d12a9baba91 -0.2212132 -0.3628818 -0.2650607 -0.3584369
## e7e19b672a061e119437a7cf815db964 3.1676181 -0.3628818 -0.2650607 -0.3584369
## 5eca5c84925c78b71899493b972dcc65 -0.2212132 -0.3628818 -0.2650607 -0.3584369
## c3bd77a0f6486c74b4b713ce8a5f285b -0.2212132 -0.3628818 -0.2650607 -0.3584369
## bdb152823a97308c1f049837f0cf13db -0.2212132 -0.3628818 -0.2650607 -0.3584369
## BAQ2420.3 BAQ2462.1 BAQ2462.2 BAQ2462.3
## 9e5c513fb56d18593fe6a31808b722d8 -0.3375948 -0.3178584 -0.3252743 -0.2947759
## fce1dd8b549c362b7abf3d12a9baba91 -0.3375948 -0.3178584 -0.3252743 -0.2947759
## e7e19b672a061e119437a7cf815db964 -0.3375948 -0.3178584 -0.3252743 -0.2947759
## 5eca5c84925c78b71899493b972dcc65 -0.3375948 -0.3178584 -0.3252743 -0.2947759
## c3bd77a0f6486c74b4b713ce8a5f285b -0.3375948 -0.3178584 -0.3252743 -0.2947759
## bdb152823a97308c1f049837f0cf13db -0.3375948 -0.3178584 -0.3252743 -0.2947759
## BAQ2687.1 BAQ2687.2 BAQ2687.3 BAQ2838.1
## 9e5c513fb56d18593fe6a31808b722d8 -0.2658175 -0.2173518 -0.1977687 -0.2473847
## fce1dd8b549c362b7abf3d12a9baba91 -0.2658175 -0.2173518 -0.1977687 -0.2473847
## e7e19b672a061e119437a7cf815db964 -0.2658175 -0.2173518 -0.1977687 -0.2473847
## 5eca5c84925c78b71899493b972dcc65 -0.2658175 -0.2173518 -0.1977687 -0.2473847
## c3bd77a0f6486c74b4b713ce8a5f285b -0.2658175 -0.2173518 -0.1977687 -0.2473847
## bdb152823a97308c1f049837f0cf13db -0.2658175 -0.2173518 -0.1977687 -0.2473847
## BAQ2838.2 BAQ2838.3 BAQ3473.1 BAQ3473.2
## 9e5c513fb56d18593fe6a31808b722d8 -0.2221959 -0.1689595 -0.3084103 -0.3639087
## fce1dd8b549c362b7abf3d12a9baba91 -0.2221959 -0.1689595 -0.3084103 -0.3639087
## e7e19b672a061e119437a7cf815db964 -0.2221959 -0.1689595 -0.3084103 -0.3639087
## 5eca5c84925c78b71899493b972dcc65 -0.2221959 -0.1689595 -0.3084103 -0.3639087
## c3bd77a0f6486c74b4b713ce8a5f285b -0.2221959 -0.1689595 -0.3084103 -0.3639087
## bdb152823a97308c1f049837f0cf13db -0.2221959 -0.1689595 -0.3084103 -0.3639087
## BAQ3473.3 BAQ4166.1.1 BAQ4166.1.2 BAQ4166.1.3
## 9e5c513fb56d18593fe6a31808b722d8 -0.1665336 -0.2747831 -0.3323028 -0.2839084
## fce1dd8b549c362b7abf3d12a9baba91 -0.1665336 -0.2747831 -0.3323028 -0.2839084
## e7e19b672a061e119437a7cf815db964 -0.1665336 -0.2747831 -0.3323028 -0.2839084
## 5eca5c84925c78b71899493b972dcc65 -0.1665336 -0.2747831 3.4667742 -0.2839084
## c3bd77a0f6486c74b4b713ce8a5f285b 4.8160953 -0.2747831 -0.3323028 -0.2839084
## bdb152823a97308c1f049837f0cf13db -0.1665336 -0.2747831 -0.3323028 -0.2839084
## BAQ4166.2 BAQ4166.3 BAQ4697.1 BAQ4697.2
## 9e5c513fb56d18593fe6a31808b722d8 -0.3082617 -0.3856309 -0.374190 -0.307527
## fce1dd8b549c362b7abf3d12a9baba91 -0.3082617 -0.3856309 -0.374190 -0.307527
## e7e19b672a061e119437a7cf815db964 -0.3082617 -0.3856309 -0.374190 -0.307527
## 5eca5c84925c78b71899493b972dcc65 -0.3082617 -0.3856309 2.050374 -0.307527
## c3bd77a0f6486c74b4b713ce8a5f285b -0.3082617 -0.3856309 -0.374190 -0.307527
## bdb152823a97308c1f049837f0cf13db -0.3082617 -0.3856309 -0.374190 -0.307527
## BAQ4697.3 YUN1005.1.1 YUN1005.3 YUN1242.1
## 9e5c513fb56d18593fe6a31808b722d8 -0.3488606 -0.1893163 -0.2056995 -0.1805398
## fce1dd8b549c362b7abf3d12a9baba91 -0.3488606 -0.1893163 -0.2056995 -0.1805398
## e7e19b672a061e119437a7cf815db964 -0.3488606 -0.1893163 -0.2056995 -0.1805398
## 5eca5c84925c78b71899493b972dcc65 -0.3488606 -0.1893163 -0.2056995 -0.1805398
## c3bd77a0f6486c74b4b713ce8a5f285b -0.3488606 -0.1893163 -0.2056995 -0.1805398
## bdb152823a97308c1f049837f0cf13db -0.3488606 -0.1893163 -0.2056995 -0.1805398
## YUN1242.3 YUN1609.1 YUN2029.2 YUN3153.2
## 9e5c513fb56d18593fe6a31808b722d8 -0.1865694 -0.1983486 -0.2321907 -0.2574349
## fce1dd8b549c362b7abf3d12a9baba91 -0.1865694 -0.1983486 -0.2321907 -0.2574349
## e7e19b672a061e119437a7cf815db964 -0.1865694 -0.1983486 -0.2321907 -0.2574349
## 5eca5c84925c78b71899493b972dcc65 -0.1865694 -0.1983486 -0.2321907 -0.2574349
## c3bd77a0f6486c74b4b713ce8a5f285b -0.1865694 -0.1983486 -0.2321907 -0.2574349
## bdb152823a97308c1f049837f0cf13db -0.1865694 -0.1983486 -0.2321907 -0.2574349
## YUN3153.3 YUN3259.1.1 YUN3259.1.2 YUN3259.1.3
## 9e5c513fb56d18593fe6a31808b722d8 -0.2610856 -0.0835811 -0.340611 -0.1745068
## fce1dd8b549c362b7abf3d12a9baba91 -0.2610856 -0.0835811 -0.340611 -0.1745068
## e7e19b672a061e119437a7cf815db964 -0.2610856 -0.0835811 -0.340611 -0.1745068
## 5eca5c84925c78b71899493b972dcc65 -0.2610856 -0.0835811 -0.340611 -0.1745068
## c3bd77a0f6486c74b4b713ce8a5f285b -0.2610856 -0.0835811 -0.340611 -0.1745068
## bdb152823a97308c1f049837f0cf13db -0.2610856 -0.0835811 -0.340611 -0.1745068
## YUN3259.2 YUN3259.3 YUN3346.1 YUN3346.2
## 9e5c513fb56d18593fe6a31808b722d8 -0.2920226 -0.3296262 -0.3290151 -0.2128194
## fce1dd8b549c362b7abf3d12a9baba91 -0.2920226 -0.3296262 -0.3290151 -0.2128194
## e7e19b672a061e119437a7cf815db964 -0.2920226 -0.3296262 -0.3290151 -0.2128194
## 5eca5c84925c78b71899493b972dcc65 -0.2920226 -0.3296262 -0.3290151 -0.2128194
## c3bd77a0f6486c74b4b713ce8a5f285b -0.2920226 -0.3296262 -0.3290151 -0.2128194
## bdb152823a97308c1f049837f0cf13db -0.2920226 -0.3296262 -0.3290151 -0.2128194
## YUN3346.3 YUN3428.1 YUN3428.2 YUN3428.3
## 9e5c513fb56d18593fe6a31808b722d8 -0.2893778 -0.3039452 2.474385 3.4726658
## fce1dd8b549c362b7abf3d12a9baba91 -0.2893778 -0.3039452 -0.432200 -0.3082498
## e7e19b672a061e119437a7cf815db964 -0.2893778 3.6273357 -0.432200 -0.3082498
## 5eca5c84925c78b71899493b972dcc65 -0.2893778 -0.3039452 -0.432200 -0.3082498
## c3bd77a0f6486c74b4b713ce8a5f285b -0.2893778 -0.3039452 -0.432200 -0.3082498
## bdb152823a97308c1f049837f0cf13db -0.2893778 2.3195337 2.775787 -0.3082498
## YUN3533.1.1 YUN3533.1.2 YUN3533.1.3 YUN3533.2
## 9e5c513fb56d18593fe6a31808b722d8 -0.1820351 -0.3002894 -0.3051522 -0.3547928
## fce1dd8b549c362b7abf3d12a9baba91 -0.1820351 -0.3002894 -0.3051522 1.8668232
## e7e19b672a061e119437a7cf815db964 -0.1820351 -0.3002894 -0.3051522 2.3967425
## 5eca5c84925c78b71899493b972dcc65 -0.1820351 -0.3002894 -0.3051522 -0.3547928
## c3bd77a0f6486c74b4b713ce8a5f285b -0.1820351 -0.3002894 -0.3051522 -0.3547928
## bdb152823a97308c1f049837f0cf13db -0.1820351 -0.3002894 -0.3051522 2.2101566
## YUN3533.3 YUN3856.1.1 YUN3856.1.2 YUN3856.1.3
## 9e5c513fb56d18593fe6a31808b722d8 -0.3293196 -0.1991167 -0.3602824 -0.3349785
## fce1dd8b549c362b7abf3d12a9baba91 -0.3293196 -0.1991167 2.6967911 -0.3349785
## e7e19b672a061e119437a7cf815db964 -0.3293196 -0.1991167 -0.3602824 -0.3349785
## 5eca5c84925c78b71899493b972dcc65 -0.3293196 -0.1991167 -0.3602824 -0.3349785
## c3bd77a0f6486c74b4b713ce8a5f285b -0.3293196 -0.1991167 -0.3602824 -0.3349785
## bdb152823a97308c1f049837f0cf13db -0.3293196 -0.1991167 -0.3602824 -0.3349785
## YUN3856.2 YUN3856.3
## 9e5c513fb56d18593fe6a31808b722d8 -0.3453369 -0.4737605
## fce1dd8b549c362b7abf3d12a9baba91 -0.3453369 2.3833124
## e7e19b672a061e119437a7cf815db964 -0.3453369 -0.4737605
## 5eca5c84925c78b71899493b972dcc65 -0.3453369 -0.4737605
## c3bd77a0f6486c74b4b713ce8a5f285b -0.3453369 -0.4737605
## bdb152823a97308c1f049837f0cf13db -0.3453369 -0.4737605
#Extract Matrix and Sample Data
ps_clr_v<-vegan_otu(ps_clr)
ps_clr_s<-as(sample_data(ps_clr),"data.frame")
######## Principal Components Analysis
ps_pc<-prcomp(ps_clr_v)
#Cool Biplot Showing How Diff ASVs affect the primary axes of the ordinatiton
biplot(ps_pc)
#Scree plot of relative importance of explained by each axis
plot(ps_pc)
#Variance Explained by FIrst 2 axes
summary(ps_pc)$importance[,1:2] #
## PC1 PC2
## Standard deviation 3.579576 2.817961
## Proportion of Variance 0.119370 0.073970
## Cumulative Proportion 0.119370 0.193340
#### Extract Scores for Plotting
library(vegan)
ps_pc_scores<-scores(ps_pc)
ps_pc_scores_sub<-ps_pc_scores[,1:2]
#Add Sample Data
ps_pc_scores_sub<-cbind(ps_pc_scores_sub,ps_clr_s)
######### Plot
library(ggplot2)
library(cowplot)
#Housekeeping + Label Setup
# ps_pc_scores_sub$Species<-as.factor(ps_pc_scores_sub$Full_names)
#Plot
pca1<-ggplot(ps_pc_scores_sub,aes(x=PC1,y=PC2)) + stat_ellipse(type="t",aes(color=transect.name),level = 0.95,alpha=0.5) + geom_point(aes(colour=transect.name),size=5)
pca2<-pca1 + theme_bw() + labs(fill="Transect Name",x="PC1 (11.7%)",y="PC2 (7.3%)")
pca3<-pca2 + theme(legend.position="top",axis.text=element_text(size=18),axis.title=element_text(size=20),legend.text = element_text(size=12),legend.title = element_text(size=18))
pca3
Here we’ll re-do the CLR transform above, but makign sure that we clean the data adequately to include only complete cases of some variables, and make sure there are no taxa with zero counts. There _shouldn’t be any, because we’re using our stringently-subsetted physeq100 dataset, but it’s good data practice.
############### PERMANOVA ON CLR-Transformed DISTANCES
### Need Complete Cases for Metadata
#Subset ECU to Complete Cases for Breedign Mode (as eg)
physeq_complete<-prune_samples(!is.na(sample_data(physeq100)$transect.name),physeq100)
#Can't Have Zero-Sum taxa for distances
physeq_complete_nonzero<-prune_taxa(taxa_sums(physeq_complete)>0,physeq_complete)
#CLR Transform on unrarefied object
physeq_complete_clr <- microbiome::transform(physeq_complete_nonzero, "clr")
#Extract Matrix and Sample Data
physeq_complete_clr_v<-vegan_otu(physeq_complete_clr)
physeq_complete_clr_s<-as(sample_data(physeq_complete_clr),"data.frame")
## Perform PERMANOVA
ps_clr_perm <- adonis2(physeq_complete_clr_v ~ site.name,data=physeq_complete_clr_s, permutations=999,method="euclidean")
ps_clr_perm
## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 999
##
## adonis2(formula = physeq_complete_clr_v ~ site.name, data = physeq_complete_clr_s, permutations = 999, method = "euclidean")
## Df SumOfSqs R2 F Pr(>F)
## site.name 16 2885.6 0.50719 2.3799 0.001 ***
## Residual 37 2803.8 0.49281
## Total 53 5689.3 1.00000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Compare the PERMANOVA model above of CLR-transformed data with a ‘standard’ model from the same ‘physeq100’ dataset fitted to rarefied data.
Sometimes we might want to calculate derived quantities of microbiome community composition, like alpha diversity, and use them in more classical statistical models like GLMs or GLMMs. One reason for this is because the data might be inherently structured. In the case of the soil dataset, sampling took place at multiple sites, which themselves were nested across a wider spatial scale of 2 transects, so we might want to account for that in our models.
Here we’ll run some models to look at predictors of alpha diversity in our dataset. We’ll use the code we used to generate use our soil richness data from above, copied below for reference.
soil_richness<-estimate_richness(ps_rare,measures=c("Observed","Shannon","InvSimpson"))
#Add Richness Onto Our Metadata
soil_richness$Sample<-rownames(soil_richness)
soil_richness<-left_join(soil_richness,soil_meta,"Sample")
Now we have everything in one dataframe for fitting models to. Here we’ll fit a model where we try and explain the factors predicting alpha diversity. We will use:
Shannon diversity as the response variable, with a Gaussian error structure
Vegatation interacting with Transect as FIXED effects
Site ID as a RANDOM effect - to account for the fact that there is non-independence among samples taken from the same site
Note that if any of these variables have missing data (NA values), then R will drop them from the model. But for model comparison we need to make sure all models are fitted to the same number of data
library(lme4)
#Complete Dataset
soil_richness_complete<-soil_richness[complete.cases(soil_richness[,c("vegetation","average.soil.relative.humidity","transect.name","site.name")]),]
#Fit Model
m1<-lmer(Shannon ~ average.soil.relative.humidity + vegetation + transect.name + (1|site.name),data=soil_richness_complete)
summary(m1)
## Linear mixed model fit by REML ['lmerMod']
## Formula:
## Shannon ~ average.soil.relative.humidity + vegetation + transect.name +
## (1 | site.name)
## Data: soil_richness_complete
##
## REML criterion at convergence: 34
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.60366 -0.52589 0.03536 0.55380 2.40288
##
## Random effects:
## Groups Name Variance Std.Dev.
## site.name (Intercept) 0.002139 0.04625
## Residual 0.084071 0.28995
## Number of obs: 46, groups: site.name, 16
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 2.460270 0.177941 13.826
## average.soil.relative.humidity 0.008338 0.002691 3.098
## vegetationyes 0.041082 0.146510 0.280
## transect.nameYungay -0.231647 0.095288 -2.431
##
## Correlation of Fixed Effects:
## (Intr) avr... vgttny
## avrg.sl.rl. -0.878
## vegetatinys 0.496 -0.773
## trnsct.nmYn -0.498 0.299 -0.312
m1_glm<-glm(Shannon ~ average.soil.relative.humidity,data=soil_richness_complete)
summary(m1_glm)
##
## Call:
## glm(formula = Shannon ~ average.soil.relative.humidity, data = soil_richness_complete)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.87984 -0.14419 0.01262 0.20905 0.61334
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.280601 0.145389 15.686 < 2e-16 ***
## average.soil.relative.humidity 0.009275 0.001750 5.301 3.55e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.09458495)
##
## Null deviance: 6.8192 on 45 degrees of freedom
## Residual deviance: 4.1617 on 44 degrees of freedom
## AIC: 26.018
##
## Number of Fisher Scoring iterations: 2
We can inspect our model using ‘summary’
summary(m1)
## Linear mixed model fit by REML ['lmerMod']
## Formula:
## Shannon ~ average.soil.relative.humidity + vegetation + transect.name +
## (1 | site.name)
## Data: soil_richness_complete
##
## REML criterion at convergence: 34
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.60366 -0.52589 0.03536 0.55380 2.40288
##
## Random effects:
## Groups Name Variance Std.Dev.
## site.name (Intercept) 0.002139 0.04625
## Residual 0.084071 0.28995
## Number of obs: 46, groups: site.name, 16
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 2.460270 0.177941 13.826
## average.soil.relative.humidity 0.008338 0.002691 3.098
## vegetationyes 0.041082 0.146510 0.280
## transect.nameYungay -0.231647 0.095288 -2.431
##
## Correlation of Fixed Effects:
## (Intr) avr... vgttny
## avrg.sl.rl. -0.878
## vegetatinys 0.496 -0.773
## trnsct.nmYn -0.498 0.299 -0.312
First, we convert our model from REML (REstricted Maximum Likelihood) to ML (Maximum Likelihood) to allow us to compare models with different fixed effects structure. Then we’ll rank our model by AICc to compare model fit whilst also penalising models for their complexity.
Model selection philosophy is a complex issue. We’re using Information Theory here (AIC and related information critera), but other options are available, such as frequentist based p values, or Bayesian models and associated model selection. This paper here should proivde a solid foundation to the various approaches available to using mixed effects models in R.
library(MuMIn)
#Convert model to ML
m1_ml<-update(m1,REML=F)
## boundary (singular) fit: see ?isSingular
summary(m1_ml)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula:
## Shannon ~ average.soil.relative.humidity + vegetation + transect.name +
## (1 | site.name)
## Data: soil_richness_complete
##
## AIC BIC logLik deviance df.resid
## 25.4 36.3 -6.7 13.4 40
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.76877 -0.56083 0.02674 0.58486 2.56802
##
## Random effects:
## Groups Name Variance Std.Dev.
## site.name (Intercept) 0.00000 0.0000
## Residual 0.07828 0.2798
## Number of obs: 46, groups: site.name, 16
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 2.461706 0.166379 14.796
## average.soil.relative.humidity 0.008288 0.002501 3.314
## vegetationyes 0.043305 0.135200 0.320
## transect.nameYungay -0.230945 0.088105 -2.621
##
## Correlation of Fixed Effects:
## (Intr) avr... vgttny
## avrg.sl.rl. -0.882
## vegetatinys 0.496 -0.769
## trnsct.nmYn -0.489 0.297 -0.318
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see ?isSingular
#Set Global Options to Fail in the Case of NAs - done to rpevent models fitted to different amounts of data
options(na.action = "na.fail")
###### P Value
m1_null<-update(m1_ml,~.-average.soil.relative.humidity)
summary(m1_null)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: Shannon ~ vegetation + transect.name + (1 | site.name)
## Data: soil_richness_complete
##
## AIC BIC logLik deviance df.resid
## 32.9 42.0 -11.4 22.9 41
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.2023 -0.6752 -0.0686 0.4926 2.4172
##
## Random effects:
## Groups Name Variance Std.Dev.
## site.name (Intercept) 0.008903 0.09435
## Residual 0.088297 0.29715
## Number of obs: 46, groups: site.name, 16
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 2.93765 0.09697 30.295
## vegetationyes 0.40015 0.10478 3.819
## transect.nameYungay -0.32517 0.10346 -3.143
##
## Correlation of Fixed Effects:
## (Intr) vgttny
## vegetatinys -0.592
## trnsct.nmYn -0.544 -0.108
anova(m1_null,m1_ml)
## Data: soil_richness_complete
## Models:
## m1_null: Shannon ~ vegetation + transect.name + (1 | site.name)
## m1_ml: Shannon ~ average.soil.relative.humidity + vegetation + transect.name + (1 | site.name)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## m1_null 5 32.875 42.018 -11.438 22.875
## m1_ml 6 25.360 36.332 -6.680 13.360 9.5153 1 0.002038 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Rank by AICc
library(MuMIn)
m1_aic_rank<-dredge(m1_ml)
## Fixed term is "(Intercept)"
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
Deciding Which Models to Use We can insepct our ranked models as follows
m1_aic_rank
## Global model call: lmer(formula = Shannon ~ average.soil.relative.humidity + vegetation +
## transect.name + (1 | site.name), data = soil_richness_complete,
## REML = F)
## ---
## Model selection table
## (Int) avr.sol.rlt.hmd trn.nam vgt df logLik AICc delta weight
## 4 2.435 0.008904 + 5 -6.731 25.0 0.00 0.674
## 8 2.462 0.008288 + + 6 -6.680 27.5 2.55 0.188
## 2 2.272 0.009387 4 -9.903 28.8 3.82 0.100
## 6 2.240 0.010370 + 5 -9.787 31.1 6.11 0.032
## 7 2.938 + + 5 -11.438 34.4 9.41 0.006
## 5 2.737 + 4 -15.321 39.6 14.65 0.000
## 3 3.149 + 4 -16.852 42.7 17.72 0.000
## 1 2.947 3 -18.944 44.5 19.50 0.000
## Models ranked by AICc(x)
## Random terms (all models):
## '1 | site.name'
We can see that the ‘top’ model, with the lowest AICc score, contains only the MAIN effects of Vegetation and Transect ID. There is onyl 1 other model within 6 AIC units of the top model, which is widely regarded as the cutoff one should use.
However, this model is a more complex versions of the top model - they contain more parameters but have a poorer ‘fit to the data ’fit-complexity’ tradeoff. Under the nesting rule, we can ignore these models and not use them for inference, leaving us with just the top model. This makes life much simpler. Note that if we had used p values for simplification, we might have arrived at the same conclusion - where only Relative Humidity is retained for inference.
The R package MuMIn has some handy functions for calculating an analogue of r2 for mixed models
#Re-Fit Top Model
m_top<-lmer(Shannon ~ average.soil.relative.humidity + (1|site.name),data=soil_richness_complete)
summary(m_top)
## Linear mixed model fit by REML ['lmerMod']
## Formula: Shannon ~ average.soil.relative.humidity + (1 | site.name)
## Data: soil_richness_complete
##
## REML criterion at convergence: 34.8
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.66108 -0.45066 0.02597 0.62491 1.91465
##
## Random effects:
## Groups Name Variance Std.Dev.
## site.name (Intercept) 0.009864 0.09932
## Residual 0.085594 0.29256
## Number of obs: 46, groups: site.name, 16
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 2.267905 0.152811 14.841
## average.soil.relative.humidity 0.009447 0.001880 5.025
##
## Correlation of Fixed Effects:
## (Intr)
## avrg.sl.rl. -0.943
#Calculate r2
r.squaredGLMM(m_top)
## Warning: 'r.squaredGLMM' now calculates a revised statistic. See the help page.
## R2m R2c
## [1,] 0.3909167 0.4538547
Marginal r2 (R2m) is our % variance explained just due to the fixed effects. Conditional r2 (R2c) is variance explained by both fixed and random effects. As you can see here, the values are quite high. This is good!
Indicator analysis is based on this paper by Dufrene and Legendre in 1997.
The ideal indicator species: - is exclusively faithful to a particular group (EXCLUSIVITY) - occurs in all sample units within a group(FIDELITY)
Here can use the ‘vegan_otu’ function again to extract the ASV table from the phyloseq object, as it puts the ASvs as columns and samples as rows - this is the format we need for indicator analysis.
library(labdsv)
indicator_matrix<- vegan_otu(ps_rare)
Then we need a numeric vector of cluster identities that correspond to a treatment of interest (labdsv can’t just be given a factor). So let’s see if there are indicator species for each vegetation type. This code will convert the 2 level vegetation factor to numeric clusters where WITH vegetation = 1 and NO vegetation = 2.
veg_numeric<-with(sample_data(ps_rare),ifelse(vegetation=="yes",1,2))
Then we run the indicator analysis by suppling the community matrix and the numeric vector of vegetation types
#library(labdsv)
veg_indic<-indval(indicator_matrix,veg_numeric)
#Inspect Results
summary(veg_indic)
## cluster indicator_value probability
## X6b780e361cfc5f06def718518324bdcb 1 0.6250 0.001
## ef3fdbe1dcde754d91130cde6a4b4d61 1 0.4192 0.017
## X409faa5f5353e543bf6d99125c7c0e83 1 0.3438 0.017
## X5c78314ff92e6fec9aa07acc1fa0dc24 1 0.3438 0.017
## ffd60d684f32e6fd5b47fe90095f9d34 1 0.3008 0.037
## X4c8ff0ea98d2c0ebb486e33bd96c61f9 1 0.2500 0.050
## e81a04a50f341415b6b8dbb1f94c559c 1 0.2500 0.040
## dc8a2f47b3d1dc2e1f5f805891976b29 2 0.6290 0.001
## a7b877ae6d2f079a15b6b192a4425620 2 0.5593 0.002
## a36b38f754f6abd278aeb9dbc7696343 2 0.4605 0.002
## X5eee6273221f15d4e0cdc2c033f0dd72 2 0.3654 0.007
## X6e4d034e4b967b745639f825dc73ebe6 2 0.3529 0.001
## e2b3c5ef478cae4cbabc7577bf139645 2 0.2941 0.003
## X8800bca0dea2ed79f023f3e709b19076 2 0.2353 0.022
## X7452a179b9a24c364642eeaade4d266f 2 0.2353 0.013
## X9279403ab31c9b2574b09cef10f08586 2 0.2353 0.012
## X45e4364ffb37868633b53e176d598b5e 2 0.2353 0.009
## X95f85b31b6d22bec1f34814fa2b4c100 2 0.2353 0.010
## ddbc32632c9632d1dd746f28721f3a9f 2 0.2166 0.027
## X47dc56fdba4dc5de3208a99337000601 2 0.1765 0.038
## X9b86760f1d24e78119093f5f08f45f71 2 0.1765 0.035
## ce6b49bab0851644749e892e745d50db 2 0.1765 0.035
## X43a82022093106139e5e81f88ce5a975 2 0.1765 0.036
## cd5442cc61343511b0de79526804825d 2 0.1765 0.039
## f31d6dfeec66b22ac98f0605d57e7bb4 2 0.1765 0.033
## X3761a26aef7e143cb2cdc05188589703 2 0.1765 0.041
##
## Sum of probabilities = 715.219
##
## Sum of Indicator Values = 58.46
##
## Sum of Significant Indicator Values = 7.82
##
## Number of Significant Indicators = 26
##
## Significant Indicator Distribution
##
## 1 2
## 7 19
This will give us a summary of significant indicator values (with p values), and the cluster to which they ‘belong’. They are arranged by descending indicator value withtin each cluster/vegetation type.
The function also helpfully calculates the relative abundance of individual ‘species’ in each of the two clusters
head(veg_indic$relabu)
## 1 2
## X99fa718995b3129c258bd12bdabb4bec 1 0
## X2dcce6012bdc7a9cb15900ee79beb2ed 1 0
## X8133d2b0b8fcb93ae67ebaaec9667e7b 1 0
## X996b8c93a5184a88d3ec0d3965fe31b2 1 0
## d183373f95d7758e5883ac24de921410 1 0
## X3a51360d9283ba62bef12229ba447c63 1 0
We can be stringent and subset our indicator values to only those above a certain threshold
#Extract The Full Dataframe
veg_indic_output<-data.frame(indval=veg_indic$indcls,pval=veg_indic$pval,cluster=veg_indic$maxcls)
#Add in The Taxon Labels as a column
#R adds an annoying 'X' to the leading edge of hexadecimal IDs that start with numbers, so we need to strip those out using 'gsub'
veg_indic_output$taxon<-gsub("^X","",rownames(veg_indic_output))
#Subset to Indicator Values >0.5
veg_indic_significant<-subset(veg_indic_output,indval>=0.5 & pval <=0.05)
head(veg_indic_significant)
## indval pval cluster
## dc8a2f47b3d1dc2e1f5f805891976b29 0.6289763 0.001 2
## a7b877ae6d2f079a15b6b192a4425620 0.5592630 0.002 2
## X6b780e361cfc5f06def718518324bdcb 0.6250000 0.001 1
## taxon
## dc8a2f47b3d1dc2e1f5f805891976b29 dc8a2f47b3d1dc2e1f5f805891976b29
## a7b877ae6d2f079a15b6b192a4425620 a7b877ae6d2f079a15b6b192a4425620
## X6b780e361cfc5f06def718518324bdcb 6b780e361cfc5f06def718518324bdcb
nrow(veg_indic_significant)
## [1] 3
This is cool, but let’s make it more useful by adding on the taxonomy
#Annotate With Taxonomy
ps_rare_tax<-data.frame(tax_table(ps_rare))
ps_rare_tax$taxon<-rownames(ps_rare_tax)
veg_indic_significant<-left_join(veg_indic_significant,ps_rare_tax,"taxon")
head(veg_indic_significant)
## indval pval cluster taxon Kingdom
## 1 0.6289763 0.001 2 dc8a2f47b3d1dc2e1f5f805891976b29 d__Bacteria
## 2 0.5592630 0.002 2 a7b877ae6d2f079a15b6b192a4425620 d__Bacteria
## 3 0.6250000 0.001 1 6b780e361cfc5f06def718518324bdcb d__Bacteria
## Phylum Class Order Family
## 1 Proteobacteria Gammaproteobacteria Burkholderiales Burkholderiaceae
## 2 Actinobacteriota Actinobacteria Nitriliruptorales Nitriliruptoraceae
## 3 Actinobacteriota Actinobacteria Micrococcales Micrococcaceae
## Genus Species
## 1 Ralstonia <NA>
## 2 Nitriliruptoraceae uncultured_Nitriliruptor
## 3 <NA> <NA>
The pitfalls of differential abundance testing
Critically, we perform these functions on our unrarefied data (physeq) because DESeq2 will fit a model that accounts for the different mean-variance relationship introduced by different library sizes.
Some Convenience Functions for Handling DESeq Data
########## Function To Handle Results Objects and Annotate with Taxonomy
taxo<-function(resultsobject,physeqobject,alpha){
sigtab<-resultsobject[which(resultsobject$padj<alpha),]
sigtab<- cbind(as(sigtab, "data.frame"), as(tax_table(physeqobject)[rownames(sigtab), ], "matrix"))
colnames(sigtab)[7:12]<-c("Kingdom","Phylum","Class","Order","Family","Genus")
return(sigtab)
}
########## Function To Calculate Geometric Means
gm_mean = function(x, na.rm=TRUE){
exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x))
}
###### Function to parse significant data from DESeq2 results
deseqplot_data<-function(sigtab){
# Phylum order
x = tapply(sigtab$log2FoldChange, sigtab$Phylum, function(x) max(x))
x = sort(x, TRUE)
sigtab$Phylum = factor(as.character(sigtab$Phylum), levels=names(x))
# Copy Across Genus Labels and Fill in Any Unassigned
sigtab$Genus.long<-as.character(sigtab$Genus)
sigtab$Genus.long[grep("unclassified",sigtab$Genus)] <-paste0("[",as.character(sigtab$Family[grep("unclassified",sigtab$Genus)]),"]")
# Genus order
x = tapply(sigtab$log2FoldChange, sigtab$Genus.long, function(x) max(x))
x = sort(x, TRUE)
sigtab$Genus.long = factor(as.character(sigtab$Genus.long), levels=names(x))
return(sigtab)
}
Fitting Models Using DESeq
Now we can fit models to our data using DESeq as follows. THE Atacama dataset is very data-poor, so stumbles on DESeq2 analyses. So we’ll use a built in dataset here called ‘GlobalPatterns’ which maps global microbial diversity across mutliple sample types. To keep things simple, we’ll only look at two sample types Soil and Ocean
data("GlobalPatterns")
table(sample_data(GlobalPatterns)$SampleType)
##
## Feces Freshwater Freshwater (creek) Mock
## 4 2 3 3
## Ocean Sediment (estuary) Skin Soil
## 3 3 3 3
## Tongue
## 2
gp_soilocean<-subset_samples(GlobalPatterns,SampleType %in% c("Soil","Ocean"))
gp_soilocean
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 19216 taxa and 6 samples ]
## sample_data() Sample Data: [ 6 samples by 7 sample variables ]
## tax_table() Taxonomy Table: [ 19216 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 19216 tips and 19215 internal nodes ]
sample_sums(gp_soilocean)
## CL3 CC1 SV1 NP2 NP3 NP5
## 864077 1135457 697509 523634 1478965 1652754
We use the ‘phyloseq_to_deseq2’ function in phyloseq to port the data in a fromat DESeq2 recognises, whilst also fitting the model we want to examine using the ‘depends on’ ~ symbol. Here we want to look for differential abundance of ASVs based on the 2-level factor of whether Sample Type (Soil or Ocean) (3 samples each)
#Fit The Model We Are Interested in Using The function in phyloseq - gets the data in a format DESeq can read
gpmod<-phyloseq_to_deseq2(gp_soilocean, ~ SampleType)
## converting counts to integer mode
library(DESeq2)
# #Fit the DESeq model
gp.deseq = DESeq(gpmod, test="Wald", fitType="mean")
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
#Inspect The Results Names (comparisons we can make)
resultsNames(gp.deseq)
## [1] "Intercept" "SampleType_Soil_vs_Ocean"
#Extract The Results and Specify the Contrasts We Want - Here 'No Vegetation' will be used as the reference level
gp_results<-results(gp.deseq,contrast=c("SampleType","Soil","Ocean"))
#Annotate The Taxa with the taxonomy
gp_results_taxa<-taxo(gp_results,gp_soilocean,0.0001) #set threshold signfiicance
#### Store The Data in a Format that We Can Plot With GGplot
gp_plot_data<-deseqplot_data(gp_results_taxa)
head(gp_plot_data)
## baseMean log2FoldChange lfcSE stat pvalue padj
## 250392 12252.038 -13.45182 2.143351 -6.276071 3.472363e-10 2.719767e-07
## 12812 41888.526 -13.34543 2.087351 -6.393473 1.621597e-10 2.035915e-07
## 238123 3424.998 -12.66233 2.262418 -5.596816 2.183247e-08 3.461811e-06
## 535929 8721.573 -13.02494 2.299513 -5.664214 1.477000e-08 2.584240e-06
## 137837 2417.732 -13.22169 2.550415 -5.184131 2.170245e-07 1.940941e-05
## 332808 13807.033 -13.98571 2.361268 -5.922966 3.161868e-09 9.437245e-07
## Kingdom Phylum Class Order Family Genus
## 250392 Archaea Euryarchaeota Thermoplasmata E2 MarinegroupII <NA>
## 12812 Bacteria Actinobacteria Actinobacteria koll13 <NA> <NA>
## 238123 Bacteria Cyanobacteria Chloroplast Cryptophyta <NA> <NA>
## 535929 Bacteria Cyanobacteria Chloroplast Cryptophyta <NA> <NA>
## 137837 Bacteria Cyanobacteria Chloroplast Stramenopiles <NA> <NA>
## 332808 Bacteria Cyanobacteria Chloroplast Stramenopiles <NA> <NA>
## Species Genus.long
## 250392 <NA> <NA>
## 12812 <NA> <NA>
## 238123 <NA> <NA>
## 535929 <NA> <NA>
## 137837 <NA> <NA>
## 332808 <NA> <NA>
#####Generate Plot
gp_plot<-ggplot(gp_plot_data,aes(x=Genus.long, y=log2FoldChange)) + geom_point(shape=21,size=6,aes(fill=Phylum)) + theme_classic() +
theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust=0.5,size=12)) + scale_fill_brewer(palette="Set2")+geom_hline(yintercept = 0,linetype="dashed") + labs(x="Genus") + scale_shape_manual(values=c(21,22,23,24,25))
gp_plot
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
Some people are suspicious of methods like DESeq2, because they don’t account for the compositional nature of the data that seqeuncers produce. There’s some evidence that the method is overly sensitive, producing too many false positives. Luckily we have other options - we can use a modelling framework like ALDEX2 that automatically accounts for the compositional nature of our data.
Here we’ll replicate the model above we ran for DESEq2. Note the ALDEX2 model uses the raw data and does the CLR transformation for us. So we just need to give it our subsetted Global Patterns object.
library(ALDEx2)
#Strip Out Matrix using functiion above
gp_soilocean_v<-t(vegan_otu(gp_soilocean))
#Strip Out a vector of covariates from the phyloseq object (needs to be two levels)
gp_soilocean_sample<-sample_data(gp_soilocean)$SampleType
#Run ALDEX model
gp_soilocean_aldexmodel<- aldex(gp_soilocean_v, gp_soilocean_sample, mc.samples=128, test="t", effect=TRUE,
include.sample.summary=TRUE, denom="all", verbose=FALSE)
## aldex.clr: generating Monte-Carlo instances and clr values
## operating in serial mode
## computing center with all features
## aldex.ttest: doing t-test
## aldex.effect: calculating effect sizes
## Warning in `[<-.data.frame`(`*tmp*`, , nm, value = structure(list(X1 =
## -2.44361877001038, : provided 12852 variables to replace 1 variables
## Warning in `[<-.data.frame`(`*tmp*`, , nm, value = structure(list(X1 =
## -2.44361877001038, : provided 12852 variables to replace 1 variables
## Warning in `[<-.data.frame`(`*tmp*`, , nm, value = structure(list(X1 =
## -2.44361877001038, : provided 12852 variables to replace 1 variables
## Warning in `[<-.data.frame`(`*tmp*`, , nm, value = structure(list(X1 =
## -2.44361877001038, : provided 12852 variables to replace 1 variables
## Warning in `[<-.data.frame`(`*tmp*`, , nm, value = structure(list(X1 =
## -2.44361877001038, : provided 12852 variables to replace 1 variables
## Warning in `[<-.data.frame`(`*tmp*`, , nm, value = structure(list(X1 =
## -2.44361877001038, : provided 12852 variables to replace 1 variables
head(gp_soilocean_aldexmodel)
## rab.all rab.win.Ocean rab.win.Soil rab.sample.CL3 rab.sample.CC1
## 549322 -1.4030191 0.1697115 -2.7437539 -2.443619 -3.371567
## 143239 -0.4110687 -0.5411641 -0.1762045 -2.443619 -3.371567
## 255340 0.5434051 -0.4389759 6.4850145 -2.443619 -3.371567
## 144887 0.2985363 -0.6543108 0.8125212 -2.443619 -3.371567
## 141782 0.5070370 -0.6792306 2.5683251 -2.443619 -3.371567
## 215972 -0.7550722 -0.7264933 -0.8145584 -2.443619 -3.371567
## rab.sample.SV1 rab.sample.NP2 rab.sample.NP3 rab.sample.NP5 diff.btw
## 549322 -2.382425 1.721916 -1.320857 -0.3700944 -2.8959228
## 143239 -2.382425 1.721916 -1.320857 -0.3700944 0.9597222
## 255340 -2.382425 1.721916 -1.320857 -0.3700944 5.8325368
## 144887 -2.382425 1.721916 -1.320857 -0.3700944 1.1151942
## 141782 -2.382425 1.721916 -1.320857 -0.3700944 3.0308823
## 215972 -2.382425 1.721916 -1.320857 -0.3700944 -0.1282677
## diff.win effect overlap we.ep we.eBH wi.ep wi.eBH
## 549322 4.376101 -0.5798190 0.2083343 0.3453243 0.5607102 0.4335938 0.6016757
## 143239 3.964454 0.2101402 0.4010419 0.5344543 0.7102889 0.6687500 0.7830871
## 255340 5.915275 0.8761477 0.2227988 0.3099057 0.5445720 0.4351562 0.6286235
## 144887 3.610967 0.2719419 0.3782386 0.4964911 0.6818479 0.6000000 0.7397385
## 141782 4.208694 0.6438661 0.2176175 0.4096025 0.6146385 0.4453125 0.6314579
## 215972 4.547052 -0.0185348 0.4870467 0.5328614 0.7096396 0.6367188 0.7662457
#Plot (these are built in plotting functions
aldex.plot(gp_soilocean_aldexmodel, type="MW", test="welch", xlab="Dispersion",ylab="Difference")
aldex.plot(gp_soilocean_aldexmodel, type="MA", test="welch", xlab="Log-ratio abundance",ylab="Difference")
######## CONVENIENCE PLOTTING FUNCTION
#Here's a function I made that will take your aldex model object, and the inout phyloseq object, and annotate the ASVs with their taxonomoy for plotting
###Taxonomy Annotation Function
taxo_annotate<-function(aldex_obj,physeq_in){
library(dplyr)
##Annotate Taxonomy
aldex_obj$ASV<-rownames(aldex_obj)
physeq_tax<-data.frame(tax_table(physeq_in))
physeq_tax$ASV<-rownames(physeq_tax)
aldex_out<-left_join(aldex_obj,physeq_tax,"ASV")
##Annotate Significance
aldex_out$signif<-factor(with(aldex_out,ifelse(wi.eBH < 0.01 & effect > 1 | wi.eBH < 0.01 & effect < -1,"Significant","NS")))
return(aldex_out)
}
### Run Function on Rnew Results
gp_aldex_plotdata<-taxo_annotate(gp_soilocean_aldexmodel,gp_soilocean)
## How Many Sig. DIff ASVs?
with(gp_aldex_plotdata,table(wi.eBH<0.05))
##
## FALSE
## 12852
###Plot
library(ggplot2)
rnew_plot<-ggplot(gp_aldex_plotdata,aes(x=diff.win,y=diff.btw)) + geom_point(aes(shape=signif,fill=Phylum)) + theme_bw() + geom_abline(intercept=0,slope=1,lty=2) + geom_abline(intercept=0,slope=-1,lty=2) + scale_shape_manual(values=c(21,22))
rnew_plot + guides(fill = guide_legend(override.aes=list(shape=21,size=5)))
This guide has hopefully equipped you with a broad skillset of analytical tools for tackling microbiome data. Now we can apply those tools to new data to generate insights into the factors shaping host microbial communities.
#Load Human Microbiome Project Data
library(MicrobeDS)
data("HMPv35")
#Inspect Phyloseq Object - 6000 samples
HMPv35
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 10730 taxa and 6000 samples ]
## sample_data() Sample Data: [ 6000 samples by 33 sample variables ]
## tax_table() Taxonomy Table: [ 10730 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 10730 tips and 10729 internal nodes ]
The goal of this task is to generate report on variation in human microbiome diversity and structure based on body site. Use all of the tools present in this workflow.
This dataset contains samples from four major body sites. But remember in your analyses that these data are STRUCTURED. Multiple measuremenmts from the same individual, individuals arising from different areas including urban and rural. There are all things you might want to consider in your modelling.
table(sample_data(HMPv35)$env)
##
## Oral Skin Stool Vaginal
## 3289 1881 388 442
ANALYSIS GOALS - 1) How do the body sites differ in richness? (alpha diversity metrics, models of alpha diversity) - 2) How do the body sites differ in structure (heatmaps, ordinations) - 3) How do the answers in (3) change if we use CLR-transform on the data instead? - 4) What ASV’s are differentially abundant between body sites
Note you may want to subset the data to a core set of ASV’s first to make analyses easier.
#END
sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] rbiom_1.0.2 car_3.0-11
## [3] carData_3.0-4 MuMIn_1.43.17
## [5] lme4_1.1-27.1 Matrix_1.3-4
## [7] MicrobeDS_0.1.0 forcats_0.5.1
## [9] stringr_1.4.0 dplyr_1.0.7
## [11] purrr_0.3.4 readr_2.0.1
## [13] tidyr_1.1.3 tibble_3.1.4
## [15] tidyverse_1.3.1 pheatmap_1.0.12
## [17] cowplot_1.1.1 gplots_3.1.1
## [19] RColorBrewer_1.1-2 ALDEx2_1.24.0
## [21] Rfast_2.0.3 RcppZiggurat_0.1.6
## [23] Rcpp_1.0.7 zCompositions_1.3.4
## [25] truncnorm_1.0-8 NADA_1.6-1.1
## [27] survival_3.2-13 MASS_7.3-54
## [29] labdsv_2.0-1 mgcv_1.8-37
## [31] nlme_3.1-153 DESeq2_1.32.0
## [33] SummarizedExperiment_1.22.0 Biobase_2.52.0
## [35] MatrixGenerics_1.4.3 matrixStats_0.61.0
## [37] GenomicRanges_1.44.0 GenomeInfoDb_1.28.4
## [39] IRanges_2.26.0 S4Vectors_0.30.0
## [41] BiocGenerics_0.38.0 qiime2R_0.99.6
## [43] microbiome_1.14.0 ggplot2_3.3.5
## [45] vegan_2.5-7 lattice_0.20-45
## [47] permute_0.9-5 phyloseq_1.36.0
##
## loaded via a namespace (and not attached):
## [1] readxl_1.3.1 backports_1.2.1 Hmisc_4.5-0
## [4] plyr_1.8.6 igraph_1.2.6 splines_4.1.1
## [7] BiocParallel_1.26.2 digest_0.6.28 foreach_1.5.1
## [10] htmltools_0.5.2 fansi_0.5.0 magrittr_2.0.1
## [13] checkmate_2.0.0 memoise_2.0.0 cluster_2.1.2
## [16] openxlsx_4.2.4 tzdb_0.1.2 Biostrings_2.60.2
## [19] annotate_1.70.0 modelr_0.1.8 RcppParallel_5.1.4
## [22] jpeg_0.1-9 colorspace_2.0-2 rvest_1.0.1
## [25] blob_1.2.2 haven_2.4.3 xfun_0.26
## [28] crayon_1.4.1 RCurl_1.98-1.5 jsonlite_1.7.2
## [31] genefilter_1.74.0 iterators_1.0.13 ape_5.5
## [34] glue_1.4.2 gtable_0.3.0 zlibbioc_1.38.0
## [37] XVector_0.32.0 DelayedArray_0.18.0 Rhdf5lib_1.14.2
## [40] abind_1.4-5 scales_1.1.1 DBI_1.1.1
## [43] xtable_1.8-4 htmlTable_2.2.1 foreign_0.8-81
## [46] bit_4.0.4 Formula_1.2-4 DT_0.19
## [49] htmlwidgets_1.5.4 httr_1.4.2 ellipsis_0.3.2
## [52] farver_2.1.0 pkgconfig_2.0.3 XML_3.99-0.8
## [55] dbplyr_2.1.1 nnet_7.3-16 sass_0.4.0
## [58] locfit_1.5-9.4 utf8_1.2.2 labeling_0.4.2
## [61] tidyselect_1.1.1 rlang_0.4.11 reshape2_1.4.4
## [64] AnnotationDbi_1.54.1 cellranger_1.1.0 munsell_0.5.0
## [67] tools_4.1.1 cachem_1.0.6 cli_3.0.1
## [70] generics_0.1.0 RSQLite_2.2.8 ade4_1.7-18
## [73] broom_0.7.9 evaluate_0.14 biomformat_1.20.0
## [76] fastmap_1.1.0 yaml_2.2.1 fs_1.5.0
## [79] knitr_1.34 bit64_4.0.5 zip_2.2.0
## [82] caTools_1.18.2 KEGGREST_1.32.0 slam_0.1-48
## [85] xml2_1.3.2 compiler_4.1.1 rstudioapi_0.13
## [88] curl_4.3.2 png_0.1-7 reprex_2.0.1
## [91] geneplotter_1.70.0 bslib_0.3.0 stringi_1.7.4
## [94] highr_0.9 nloptr_1.2.2.2 multtest_2.48.0
## [97] vctrs_0.3.8 pillar_1.6.3 lifecycle_1.0.1
## [100] rhdf5filters_1.4.0 jquerylib_0.1.4 data.table_1.14.0
## [103] bitops_1.0-7 R6_2.5.1 latticeExtra_0.6-29
## [106] rio_0.5.27 KernSmooth_2.23-20 gridExtra_2.3
## [109] codetools_0.2-18 boot_1.3-28 gtools_3.9.2
## [112] assertthat_0.2.1 rhdf5_2.36.0 withr_2.4.2
## [115] GenomeInfoDbData_1.2.6 hms_1.1.1 grid_4.1.1
## [118] rpart_4.1-15 minqa_1.2.4 rmarkdown_2.11
## [121] Rtsne_0.15 lubridate_1.7.10 base64enc_0.1-3